DNA: Volume 1 



PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. 
PDF generated at: Sun, 11 Apr 2010 14:52:52 UTC 



Contents 



Articles 

Copyright@2009 by Bci2 

Permission is granted to copy, distribute and/or modify this document 
under the terms of the GNU Free Documentation License, Version 1.2 or 
any later version published by the Free Software Foundation, with no 
Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A 
copy of the license is included in ;the section entitled M GNU Free 
Documentation License". Edited by Bci2, with all contributors listed after 



the License statement. 2 

DNA Basics 3 

DNA 3 

DNA sequence 27 

DNA sequencing 28 

DNA profiling 37 

DNA polymerase 49 

DNA Topoisomerase 53 

List of nucleic acid simulation software 56 

DNA Structure and Functions 58 

DNA structure 58 

Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid 59 

Molecular models of DNA 64 

DNA replication 76 

DNA repair 83 

DNA Translation 95 

DNA Transcription 100 

DNA Transfer 106 

Reverse transcriptase 109 

DNA microarray 117 

Triple-stranded DNA 126 

G-quadruplex 127 

DNA Analysis 131 

Human genome 136 



References 

Article Sources and Contributors 144 
Image Sources, Licenses and Contributors 147 

Article Licenses 

License 150 



1 



Copyright® 2009 by Bci2 



2 



Permission is granted to copy, distribute 
and/or modify this document under the terms 
of the GNU Free Documentation License, 
Version 1.2 or any later version published 
by the Free Software Foundation, with no 
Invariant Sections, no Front-Cover Texts, 
and no Back-Cover Texts. A copy of the 
license is included in ;the section entitled 
"GNU Free Documentation License". Edited 
by Bci2, with all contributors listed after the 

License statement. 



3 



DNA Basics 



DNA 



Deoxyribonucleic acid ( 4 /di'Dksi'raiboUnu'klilk 'aesid/ 
Wikipedia:Media helpFile:en-us-Deoxyribonucleic_acid.ogg) 
(DNA) is a nucleic acid that contains the genetic instructions used 
in the development and functioning of all known living organisms 
and some viruses. The main role of DNA molecules is the 
long-term storage of information. DNA is often compared to a set 
of blueprints or a recipe, or a code, since it contains the instructions 
needed to construct other components of cells, such as proteins and 
RNA molecules. The DNA segments that carry this genetic 
information are called genes, but other DNA sequences have 
structural purposes, or are involved in regulating the use of this 
genetic information. 

Chemically, DNA consists of two long polymers of simple units 
called nucleotides, with backbones made of sugars and phosphate 
groups joined by ester bonds. These two strands run in opposite 
directions to each other and are therefore anti-parallel. Attached to 
each sugar is one of four types of molecules called bases. It is the 
sequence of these four bases along the backbone that encodes 
information. This information is read using the genetic code, which 
specifies the sequence of the amino acids within proteins. The code 
is read by copying stretches of DNA into the related nucleic acid The structure of part of a DNA double helix 

RNA, in a process called transcription. 

Within cells, DNA is organized into long structures called chromosomes. These chromosomes are duplicated before 
cells divide, in a process called DNA replication. Eukaryotic organisms (animals, plants, fungi, and protists) store 
most of their DNA inside the cell nucleus and some of their DNA in organelles, such as mitochondria or 
chloroplasts.^ In contrast, prokaryotes (bacteria and archaea) store their DNA only in the cytoplasm. Within the 
chromosomes, chromatin proteins such as histones compact and organize DNA. These compact structures guide the 
interactions between DNA and other proteins, helping control which parts of the DNA are transcribed. 
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Properties 



DNA is a long polymer made from 
repeating units called nucleotides. 
[4] The DNA chain is 22 to 
26 Angstroms wide (2.2 to 
2.6 nanometres), and one nucleotide unit 
is 3.3 A (0.33 nm) long. [5] Although each 
individual repeating unit is very small, 
DNA polymers can be very large 
molecules containing millions of 
nucleotides. For instance, the largest 
human chromosome, chromosome 
number 1, is approximately 220 million 
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In living organisms, DNA does not 
usually exist as a single molecule, but 
instead as a pair of molecules that are 

T71 T81 

held tightly together. These two 

long strands entwine like vines, in the 
shape of a double helix. The nucleotide 
repeats contain both the segment of the 
backbone of the molecule, which holds 
the chain together, and a base, which 
interacts with the other DNA strand in 
the helix. A base linked to a sugar is 
called a nucleoside and a base linked to a sugar and one or more phosphate groups is called a nucleotide. If multiple 

rm 

nucleotides are linked together, as in DNA, this polymer is called a polynucleotide. 

The backbone of the DNA strand is made from alternating phosphate and sugar residues J 10 ^ The sugar in DNA is 
2-deoxyribose, which is a pentose (five-carbon) sugar. The sugars are joined together by phosphate groups that form 
phosphodiester bonds between the third and fifth carbon atoms of adjacent sugar rings. These asymmetric bonds 
mean a strand of DNA has a direction. In a double helix the direction of the nucleotides in one strand is opposite to 
their direction in the other strand: the strands are antiparallel. The asymmetric ends of DNA strands are called the 5' 
(five prime) and 3' {three prime) ends, with the 5' end having a terminal phosphate group and the 3' end a terminal 
hydroxyl group. One major difference between DNA and RNA is the sugar, with the 2-deoxyribose in DNA being 
replaced by the alternative pentose sugar ribose in RNA. L 



Guanine 5 - en d 

Chemical structure of DNA. Hydrogen bonds shown as dotted lines. 
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The DNA double helix is stabilized by hydrogen bonds between 
the bases attached to the two strands. The four bases found in DNA 
are adenine (abbreviated A), cytosine (C), guanine (G) and thymine 
(T). These four bases are attached to the sugar/phosphate to form 
the complete nucleotide, as shown for adenosine monophosphate. 

These bases are classified into two types; adenine and guanine are 

fused five- and six-membered heterocyclic compounds called 

purines, while cytosine and thymine are six-membered rings called 

pyrimidines. A fifth pyrimidine base, called uracil (U), usually 

takes the place of thymine in RNA and differs from thymine by 

lacking a methyl group on its ring. Uracil is not usually found in 

DNA, occurring only as a breakdown product of cytosine. In 

addition to RNA and DNA, a large number of artificial nucleic acid 

analogues have also been created to study the proprieties of nucleic 

ri2i 

acids, or for use in biotechnology. 

Grooves 

Twin helical strands form the DNA backbone. Another double 
helix may be found by tracing the spaces, or grooves, between the 
strands. These voids are adjacent to the base pairs and may provide 
a binding site. As the strands are not directly opposite each other, 
the grooves are unequally sized. One groove, the major groove, is 

o o M3] 

22 A wide and the other, the minor groove, is 12 A wide. The 
narrowness of the minor groove means that the edges of the bases 
are more accessible in the major groove. As a result, proteins like 

transcription factors that can bind to specific sequences in double- stranded DNA usually make contacts to the sides 
of the bases exposed in the major groove/ 14 ^ This situation varies in unusual conformations of DNA within the cell 
(see below), but the major and minor grooves are always named to reflect the differences in size that would be seen 
if the DNA is twisted back into the ordinary B form. 




A section of DNA. The bases lie horizontally between 
the two spiraling strands.^ ^ Animated version at 
File:DNA orbit animated.gif. 



Base pairing 

Each type of base on one strand forms a bond with just one type of base on the other strand. This is called 
complementary base pairing. Here, purines form hydrogen bonds to pyrimidines, with A bonding only to T, and C 
bonding only to G. This arrangement of two nucleotides binding together across the double helix is called a base 
pair. As hydrogen bonds are not covalent, they can be broken and rejoined relatively easily. The two strands of DNA 
in a double helix can therefore be pulled apart like a zipper, either by a mechanical force or high temperature J As 
a result of this complementarity, all the information in the double- stranded sequence of a DNA helix is duplicated on 
each strand, which is vital in DNA replication. Indeed, this reversible and specific interaction between 
complementary base pairs is critical for all the functions of DNA in living organisms. 
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Top, a GC base pair with three hydrogen bonds. Bottom, an AT base pair with two hydrogen bonds. Non-covalent 
hydrogen bonds between the pairs are shown as dashed lines. 

The two types of base pairs form different numbers of hydrogen bonds, AT forming two hydrogen bonds, and GC 

forming three hydrogen bonds (see figures, left). DNA with high GC-content is more stable than DNA with low 

GC-content, but contrary to popular belief, this is not due to the extra hydrogen bond of a GC base pair but rather the 

contribution of stacking interactions (hydrogen bonding merely provides specificity of the pairing, not stability) J 

As a result, it is both the percentage of GC base pairs and the overall length of a DNA double helix that determine 

the strength of the association between the two strands of DNA. Long DNA helices with a high GC content have 

ri7i 

stronger-interacting strands, while short helices with high AT content have weaker-interacting strands. In biology, 

parts of the DNA double helix that need to separate easily, such as the TATAAT Pribnow box in some promoters, 

ri8i 

tend to have a high AT content, making the strands easier to pull apart. In the laboratory, the strength of this 

interaction can be measured by finding the temperature required to break the hydrogen bonds, their melting 

temperature (also called T value). When all the base pairs in a DNA double helix melt, the strands separate and 

exist in solution as two entirely independent molecules. These single- stranded DNA molecules have no single 

ri9i 

common shape, but some conformations are more stable than others. 
Sense and antisense 

A DNA sequence is called "sense" if its sequence is the same as that of a messenger RNA copy that is translated into 
protein P 0 ^ The sequence on the opposite strand is called the "antisense" sequence. Both sense and antisense 
sequences can exist on different parts of the same strand of DNA (i.e. both strands contain both sense and antisense 
sequences). In both prokaryotes and eukaryotes, antisense RNA sequences are produced, but the functions of these 
RNAs are not entirely clear. One proposal is that antisense RNAs are involved in regulating gene expression 
through RNA-RNA base pairing. [22] 

A few DNA sequences in prokaryotes and eukaryotes, and more in plasmids and viruses, blur the distinction between 
sense and antisense strands by having overlapping genes. In these cases, some DNA sequences do double duty, 
encoding one protein when read along one strand, and a second protein when read in the opposite direction along the 
other strand. In bacteria, this overlap may be involved in the regulation of gene transcription,^ while in viruses, 
overlapping genes increase the amount of information that can be encoded within the small viral genome. 

Supercoiling 

DNA can be twisted like a rope in a process called DNA supercoiling. With DNA in its "relaxed" state, a strand 

usually circles the axis of the double helix once every 10.4 base pairs, but if the DNA is twisted the strands become 

more tightly or more loosely wound. If the DNA is twisted in the direction of the helix, this is positive 

supercoiling, and the bases are held more tightly together. If they are twisted in the opposite direction, this is 

negative supercoiling, and the bases come apart more easily. In nature, most DNA has slight negative supercoiling 

[27] 

that is introduced by enzymes called topoisomerases. These enzymes are also needed to relieve the twisting 
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stresses introduced into DNA strands during processes such as transcription and DNA replication. 



[28] 



Alternate DNA structures 

DNA exists in many possible conformations 
that include A-DNA, B-DNA, and Z-DNA 
forms, although, only B-DNA and Z-DNA 
have been directly observed in functional 
organisms J The conformation that DNA 
adopts depends on the hydration level, DNA 
sequence, the amount and direction of 
supercoiling, chemical modifications of the 
bases, the type and concentration of metal 
ions, as well as the presence of poly amines in 





solution. 



[29] 




From left to right, the structures of A, B and Z DNA 



The first published reports of A-DNA X-ray 

diffraction patterns — and also B-DNA used analyses based on Patterson transforms that provided only a limited 

amount of structural information for oriented fibers of DNAJ 30 ^ ^ An alternate analysis was then proposed by 

Wilkins et al, in 1953, for the in vivo B-DNA X-ray diffraction/scattering patterns of highly hydrated DNA fibers in 

[32] 

terms of squares of Bessel functions. In the same journal, Watson and Crick presented their molecular modeling 

[7] 

analysis of the DNA X-ray diffraction patterns to suggest that the structure was a double-helix. 1 

[33] 

Although the v B-DNA form' is most common under the conditions found in cells, it is not a well-defined 
conformation but a family of related DNA conformations^ 34 ^ that occur at the high hydration levels present in living 
cells. Their corresponding X-ray diffraction and scattering patterns are characteristic of molecular paracrystals with a 
significant degree of disorder P 5 ^ ^ 

Compared to B-DNA, the A-DNA form is a wider right-handed spiral, with a shallow, wide minor groove and a 
narrower, deeper major groove. The A form occurs under non-physiological conditions in partially dehydrated 
samples of DNA, while in the cell it may be produced in hybrid pairings of DNA and RNA strands, as well as in 
enzyme-DNA complexes. Segments of DNA where the bases have been chemically modified by methylation 

may undergo a larger change in conformation and adopt the Z form. Here, the strands turn about the helical axis in a 
left-handed spiral, the opposite of the more common B formJ 39 ^ These unusual structures can be recognized by 



specific Z-DNA binding proteins and may be involved in the regulation of transcription. 



[40] 



DNA 



8 



Quadruplex structures 

At the ends of the linear chromosomes are 
specialized regions of DNA called telomeres. The 
main function of these regions is to allow the cell to 
replicate chromosome ends using the enzyme 
telomerase, as the enzymes that normally replicate 
DNA cannot copy the extreme 3' ends of 
chromosomes J 42 ^ These specialized chromosome 
caps also help protect the DNA ends, and stop the 
DNA repair systems in the cell from treating them as 
damage to be corrected J 43 ^ In human cells, telomeres 
are usually lengths of single- stranded DNA 
containing several thousand repeats of a simple 
TTAGGG sequence. [44] 

These guanine-rich sequences may stabilize 
chromosome ends by forming structures of stacked 
sets of four-base units, rather than the usual base 
pairs found in other DNA molecules. Here, four guanine bases form a flat plate and these flat four-base units then 
stack on top of each other, to form a stable G-quadruplex structure J 45 ^ These structures are stabilized by hydrogen 
bonding between the edges of the bases and chelation of a metal ion in the centre of each four-base unitJ 46 ^ Other 
structures can also be formed, with the central set of four bases coming from either a single strand folded around the 
bases, or several different parallel strands, each contributing one base to the central structure. 

In addition to these stacked structures, telomeres also form large loop structures called telomere loops, or T-loops. 
Here, the single- stranded DNA curls around in a long circle stabilized by telomere-binding proteins J 47 ' At the very 
end of the T-loop, the single- stranded telomere DNA is held onto a region of double- stranded DNA by the telomere 
strand disrupting the double-helical DNA and base pairing to one of the two strands. This triple- stranded structure is 
called a displacement loop or D-loop J 45 ^ 



A 




Single branch 


Multiple branches 



Branched DNA can form networks containing multiple branches. 




Branched DNA 

In DNA fraying occurs when non-complementary regions exist at the end of an otherwise complementary 
double- strand of DNA. However, branched DNA can occur if a third strand of DNA is introduced and contains 
adjoining regions able to hybridize with the frayed regions of the pre-existing double- strand. Although the simplest 
example of branched DNA involves only three strands of DNA, complexes involving additional strands and multiple 
branches are also possible J 48 ^ Branched DNA can be used in nanotechnology to construct geometric shapes, see the 
section on uses in technology below. 
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Chemical modifications 




Structure of cytosine with and without the 5-methyl group. Deamination converts 5-methylcytosine into thymine. 



Base modifications 

The expression of genes is influenced by how the DNA is packaged in chromosomes, in a structure called chromatin. 
Base modifications can be involved in packaging, with regions that have low or no gene expression usually 
containing high levels of methylation of cytosine bases. For example, cytosine methylation, produces 
5-methylcytosine, which is important for X-chromosome inactivationJ 49 ^ The average level of methylation varies 
between organisms - the worm Caenorhabditis elegans lacks cytosine methylation, while vertebrates have higher 
levels, with up to 1% of their DNA containing 5-methylcytosine P^ Despite the importance of 5-methylcytosine, it 
can deaminate to leave a thymine base, methylated cytosines are therefore particularly prone to mutations P ^ Other 
base modifications include adenine methylation in bacteria, the presence of 5-hydroxymethylcytosine in the brain, 1 
and the glycosylation of uracil to produce the "J-base" in kinetoplastids P 3 ^ ^ 



Damage 

DNA can be damaged by many sorts of mutagens, 
which change the DNA sequence. Mutagens include 
oxidizing agents, alkylating agents and also 
high-energy electromagnetic radiation such as 
ultraviolet light and X-rays. The type of DNA damage 
produced depends on the type of mutagen. For 
example, UV light can damage DNA by producing 
thymine dimers, which are cross-links between 
pyrimidine bases P 6 ^ On the other hand, oxidants such 
as free radicals or hydrogen peroxide produce multiple 
forms of damage, including base modifications, 
particularly of guanosine, and double-strand breaks. 
A typical human cell contains about 150,000 bases that 

T581 

have suffered oxidative damage. Of these oxidative 
lesions, the most dangerous are double-strand breaks, 
as these are difficult to repair and can produce point 
mutations, insertions and deletions from the DNA 
sequence, as well as chromosomal translocations P 9 ^ 

Many mutagens fit into the space between two adjacent 
base pairs, this is called intercalating. Most 
intercalators are aromatic and planar molecules, and 
include Ethidium bromide, daunomycin, and 




A covalent adduct between benzo[a]pyrene, the major mutagen in 
tobacco smoke, and DNA^~^ 



DNA 



10 



doxorubicin. In order for an intercalator to fit between base pairs, the bases must separate, distorting the DNA 
strands by unwinding of the double helix. This inhibits both transcription and DNA replication, causing toxicity and 
mutations. As a result, DNA intercalators are often carcinogens, and Benzo[a]pyrene diol epoxide, acridines, 
aflatoxin and ethidium bromide are well-known examples J 60 ^ ^ ^ Nevertheless, due to their ability to inhibit 
DNA transcription and replication, other similar toxins are also used in chemotherapy to inhibit rapidly growing 
cancer cells. ^ 



Biological functions 

DNA usually occurs as linear chromosomes in eukaryotes, and circular chromosomes in prokaryotes. The set of 
chromosomes in a cell makes up its genome; the human genome has approximately 3 billion base pairs of DNA 
arranged into 46 chromosomes J 64 ^ The information carried by DNA is held in the sequence of pieces of DNA called 
genes. Transmission of genetic information in genes is achieved via complementary base pairing. For example, in 
transcription, when a cell uses the information in a gene, the DNA sequence is copied into a complementary RNA 
sequence through the attraction between the DNA and the correct RNA nucleotides. Usually, this RNA copy is then 
used to make a matching protein sequence in a process called translation which depends on the same interaction 
between RNA nucleotides. Alternatively, a cell may simply copy its genetic information in a process called DNA 
replication. The details of these functions are covered in other articles; here we focus on the interactions between 
DNA and other molecules that mediate the function of the genome. 

Genes and genomes 

Genomic DNA is located in the cell nucleus of eukaryotes, as well as small amounts in mitochondria and 
chloroplasts. In prokaryotes, the DNA is held within an irregularly shaped body in the cytoplasm called the 
nucleoid J 65 ^ The genetic information in a genome is held within genes, and the complete set of this information in an 
organism is called its genotype. A gene is a unit of heredity and is a region of DNA that influences a particular 
characteristic in an organism. Genes contain an open reading frame that can be transcribed, as well as regulatory 
sequences such as promoters and enhancers, which control the transcription of the open reading frame. 

In many species, only a small fraction of the total sequence of the genome encodes protein. For example, only about 
1.5% of the human genome consists of protein-coding exons, with over 50% of human DNA consisting of 
non-coding repetitive sequences J 66 ^ The reasons for the presence of so much non-coding DNA in eukaryotic 
genomes and the extraordinary differences in genome size, or C-value, among species represent a long-standing 
puzzle known as the "C-value enigma. " t67] However, DNA sequences that do not code protein may still encode 
functional non-coding RNA molecules, which are involved in the regulation of gene expression J 68 ^ 
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Some non-coding DNA sequences play 
structural roles in chromosomes. Telomeres 
and centromeres typically contain few 
genes, but are important for the function and 
stability of chromosomes J 43 ^ ^ An 
abundant form of non-coding DNA in 
humans are pseudogenes, which are copies 
of genes that have been disabled by 
mutation P ^ These sequences are usually 
just molecular fossils, although they can 
occasionally serve as raw genetic material 
for the creation of new genes through the 
process of gene duplication and 
divergence F 2 ^ 



Transcription and translation 

A gene is a sequence of DNA that contains genetic information and can influence the phenotype of an organism. 
Within a gene, the sequence of bases along a DNA strand defines a messenger RNA sequence, which then defines 
one or more protein sequences. The relationship between the nucleotide sequences of genes and the amino-acid 
sequences of proteins is determined by the rules of translation, known collectively as the genetic code. The genetic 
code consists of three-letter 'words' called codons formed from a sequence of three nucleotides (e.g. ACT, CAG, 
TTT). 

In transcription, the codons of a gene are copied into messenger RNA by RNA polymerase. This RNA copy is then 
decoded by a ribosome that reads the RNA sequence by base-pairing the messenger RNA to transfer RNA, which 
carries amino acids. Since there are 4 bases in 3-letter combinations, there are 64 possible codons ( /J 3 
combinations). These encode the twenty standard amino acids, giving most amino acids more than one possible 
codon. There are also three 'stop' or 'nonsense' codons signifying the end of the coding region; these are the TAA, 
TGA and TAG codons. 




Replication 

Cell division is essential for an 
organism to grow, but when a cell 
divides it must replicate the DNA in its 
genome so that the two daughter cells 
have the same genetic information as 
their parent. The double- stranded 
structure of DNA provides a simple 
mechanism for DNA replication. Here, 
the two strands are separated and then 
each strand's complementary DNA 
sequence is recreated by an enzyme 
called DNA polymerase. This enzyme 
makes the complementary strand by 
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Topoisomerase 



Single strand, 
Binding proteins 

DNA replication. The double helix is unwound by a helicase and topoisomerase. Next, 
one DNA polymerase produces the leading strand copy. Another DNA polymerase binds 
to the lagging strand. This enzyme makes discontinuous segments (called Okazaki 
fragments) before DNA ligase joins them together. 



finding the correct base through complementary base pairing, and bonding it onto the original strand. As DNA 
polymerases can only extend a DNA strand in a 5' to 3' direction, different mechanisms are used to copy the 
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antiparallel strands of the double helix. In this way, the base on the old strand dictates which base appears on the 
new strand, and the cell ends up with a perfect copy of its DNA. 



Interactions with proteins 

All the functions of DNA depend on interactions with proteins. These protein interactions can be non-specific, or the 
protein can bind specifically to a single DNA sequence. Enzymes can also bind to DNA and of these, the 
polymerases that copy the DNA base sequence in transcription and DNA replication are particularly important. 



DNA-binding proteins 




Interaction of DNA with histones (shown in white, top). These proteins' basic amino acids (below left, blue) bind to 
the acidic phosphate groups on DNA (below right, red). 

Structural proteins that bind DNA are well-understood examples of non-specific DNA-protein interactions. Within 
chromosomes, DNA is held in complexes with structural proteins. These proteins organize the DNA into a compact 
structure called chromatin. In eukaryotes this structure involves DNA binding to a complex of small basic proteins 
called histones, while in prokaryotes multiple types of proteins are involved ^ The histones form a disk-shaped 
complex called a nucleosome, which contains two complete turns of double- stranded DNA wrapped around its 
surface. These non-specific interactions are formed through basic residues in the histones making ionic bonds to the 
acidic sugar-phosphate backbone of the DNA, and are therefore largely independent of the base sequence. 



[76] 
[77] 



Chemical modifications of these basic amino acid residues include methylation, phosphorylation and acetylation. 
These chemical changes alter the strength of the interaction between the DNA and the histones, making the DNA 

T7R1 

more or less accessible to transcription factors and changing the rate of transcription. Other non-specific 
DNA-binding proteins in chromatin include the high-mobility group proteins, which bind to bent or distorted 

T791 

DNA. These proteins are important in bending arrays of nucleosomes and arranging them into the larger 
structures that make up chromosomes J 80 ^ 

A distinct group of DNA-binding proteins are the DNA-binding proteins that specifically bind single- stranded DNA. 
In humans, replication protein A is the best-understood member of this family and is used in processes where the 

ron 

double helix is separated, including DNA replication, recombination and DNA repair. These binding proteins 
seem to stabilize single- stranded DNA and protect it from forming stem-loops or being degraded by nucleases. 
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In contrast, other proteins have evolved to bind to particular DNA 
sequences. The most intensively studied of these are the various 
transcription factors, which are proteins that regulate transcription. 
Each transcription factor binds to one particular set of DNA 
sequences and activates or inhibits the transcription of genes that have 
these sequences close to their promoters. The transcription factors do 
this in two ways. Firstly, they can bind the RNA polymerase 
responsible for transcription, either directly or through other mediator 
proteins; this locates the polymerase at the promoter and allows it to 

T831 

begin transcription. Alternatively, transcription factors can bind 
enzymes that modify the histones at the promoter; this will change the 
accessibility of the DNA template to the polymerase J 84 ^ 

As these DNA targets can occur throughout an organism's genome, 
changes in the activity of one type of transcription factor can affect 

roci 

thousands of genes. Consequently, these proteins are often the 
targets of the signal transduction processes that control responses to 
environmental changes or cellular differentiation and development. 
The specificity of these transcription factors' interactions with DNA 
come from the proteins making multiple contacts to the edges of the 

DNA bases, allowing them to "read" the DNA sequence. Most of these base-interactions are made in the major 




The lambda repressor helix-turn-helix transcription 
factor bound to its DNA targe/ 82 ^ 



groove, where the bases are most accessible. 



[86] 




The restriction enzyme EcoRV (green) in a complex with its substrate 
DNA [87] 



DNA-modifying enzymes 
Nucleases and ligases 

Nucleases are enzymes that cut DNA strands by 
catalyzing the hydrolysis of the phosphodiester bonds. 
Nucleases that hydrolyse nucleotides from the ends of 
DNA strands are called exonucleases, while 
endonucleases cut within strands. The most frequently 
used nucleases in molecular biology are the restriction 
endonucleases, which cut DNA at specific sequences. 
For instance, the EcoRV enzyme shown to the left 
recognizes the 6-base sequence 5'-GATIATC-3' and 
makes a cut at the vertical line. In nature, these 
enzymes protect bacteria against phage infection by 



digesting the phage DNA when it enters the bacterial cell, acting as part of the restriction modification system, 
technology, these sequence- specific nucleases are used in molecular cloning and DNA fingerprinting. 



[88] 



In 



Enzymes called DNA ligases can rejoin cut or broken DNA strands J 89 ^ Ligases are particularly important in lagging 
strand DNA replication, as they join together the short segments of DNA produced at the replication fork into a 
complete copy of the DNA template. They are also used in DNA repair and genetic recombination/ 89 ^ 
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Topoisomerases and helicases 

Topoisomerases are enzymes with both nuclease and ligase activity. These proteins change the amount of 

supercoiling in DNA. Some of these enzymes work by cutting the DNA helix and allowing one section to rotate, 

[27] 

thereby reducing its level of supercoiling; the enzyme then seals the DNA break. Other types of these enzymes 
are capable of cutting one DNA helix and then passing a second strand of DNA through this break, before rejoining 
the helix P 0 ^ Topoisomerases are required for many processes involving DNA, such as DNA replication and 
transcription P^ 

Helicases are proteins that are a type of molecular motor. They use the chemical energy in nucleoside triphosphates, 
predominantly ATP, to break hydrogen bonds between bases and unwind the DNA double helix into single 
strands. These enzymes are essential for most processes where enzymes need to access the DNA bases. 

Polymerases 

Polymerases are enzymes that synthesize polynucleotide chains from nucleoside triphosphates. The sequence of their 
products are copies of existing polynucleotide chains - which are called templates. These enzymes function by 
adding nucleotides onto the 3' hydroxy 1 group of the previous nucleotide in a DNA strand. Consequently, all 
polymerases work in a 5' to 3' direction. 1 In the active site of these enzymes, the incoming nucleoside triphosphate 
base-pairs to the template: this allows polymerases to accurately synthesize the complementary strand of their 
template. Polymerases are classified according to the type of template that they use. 

In DNA replication, a DNA-dependent DNA polymerase makes a copy of a DNA sequence. Accuracy is vital in this 
process, so many of these polymerases have a proofreading activity. Here, the polymerase recognizes the occasional 
mistakes in the synthesis reaction by the lack of base pairing between the mismatched nucleotides. If a mismatch is 
detected, a 3' to 5' exonuclease activity is activated and the incorrect base removed P^ In most organisms DNA 
polymerases function in a large complex called the replisome that contains multiple accessory subunits, such as the 
DNA clamp or helicases P^ 

RNA-dependent DNA polymerases are a specialized class of polymerases that copy the sequence of an RNA strand 
into DNA. They include reverse transcriptase, which is a viral enzyme involved in the infection of cells by 
retroviruses, and telomerase, which is required for the replication of telomeres P 2 ^ ^ Telomerase is an unusual 
polymerase because it contains its own RNA template as part of its structure P^ 

Transcription is carried out by a DNA-dependent RNA polymerase that copies the sequence of a DNA strand into 
RNA. To begin transcribing a gene, the RNA polymerase binds to a sequence of DNA called a promoter and 
separates the DNA strands. It then copies the gene sequence into a messenger RNA transcript until it reaches a 
region of DNA called the terminator, where it halts and detaches from the DNA. As with human DNA-dependent 
DNA polymerases, RNA polymerase II, the enzyme that transcribes most of the genes in the human genome, 
operates as part of a large protein complex with multiple regulatory and accessory subunits P 6 ^ 
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Genetic recombination 





C2 



Structure of the Holliday junction intermediate in genetic recombination. The four separate DNA strands are 
coloured red, blue, green and yellow. 

A DNA helix usually does not interact with other 
segments of DNA, and in human cells the different 
chromosomes even occupy separate areas in the 
nucleus called "chromosome territories".^ This 
physical separation of different chromosomes is 
important for the ability of DNA to function as a stable 
repository for information, as one of the few times 
chromosomes interact is during chromosomal crossover 
when they recombine. Chromosomal crossover is when 
two DNA helices break, swap a section and then rejoin. 

Recombination allows chromosomes to exchange 
genetic information and produces new combinations of 
genes, which increases the efficiency of natural 
selection and can be important in the rapid evolution of 

mm 

new proteins. Genetic recombination can also be involved in DNA repair, particularly in the cell's response to 
double- strand breaks/ 100 ^ 

The most common form of chromosomal crossover is homologous recombination, where the two chromosomes 
involved share very similar sequences. Non-homologous recombination can be damaging to cells, as it can produce 
chromosomal translocations and genetic abnormalities. The recombination reaction is catalyzed by enzymes known 
as recombinases, such as RAD51 J 101 ^ The first step in recombination is a double- stranded break either caused by an 
endonuclease or damage to the DNAJ 102 ^ A series of steps catalyzed in part by the recombinase then leads to joining 
of the two helices by at least one Holliday junction, in which a segment of a single strand in each helix is annealed to 
the complementary strand in the other helix. The Holliday junction is a tetrahedral junction structure that can be 
moved along the pair of chromosomes, swapping one strand for another. The recombination reaction is then halted 
by cleavage of the junction and re-ligation of the released DNA.'- 103 -' 



Recombination involves the breakage and rejoining of two 
chromosomes (M and F) to produce two re-arranged chromosomes 
(CI and C2). 
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Evolution 

DNA contains the genetic information that allows all modern living things to function, grow and reproduce. 
However, it is unclear how long in the 4-billion-year history of life DNA has performed this function, as it has been 
proposed that the earliest forms of life may have used RNA as their genetic material P 2 ^ tl04] RNA may have acted 
as the central part of early cell metabolism as it can both transmit genetic information and carry out catalysis as part 
of ribozymesJ 105 ^ This ancient RNA world where nucleic acid would have been used for both catalysis and genetics 
may have influenced the evolution of the current genetic code based on four nucleotide bases. This would occur 
since the number of different bases in such an organism is a trade-off between a small number of bases increasing 
replication accuracy and a large number of bases increasing the catalytic efficiency of ribozymesJ 106 ^ 

Unfortunately, there is no direct evidence of ancient genetic systems, as recovery of DNA from most fossils is 
impossible. This is because DNA will survive in the environment for less than one million years and slowly degrades 
into short fragments in solution.^ 107 ' Claims for older DNA have been made, most notably a report of the isolation of 
a viable bacterium from a salt crystal 250 million years oldj 108] but these claims are controversial. tl09] tll0] 



Uses in technology 
Genetic engineering 

Methods have been developed to purify DNA from organisms, such as phenol-chloroform extraction and manipulate 
it in the laboratory, such as restriction digests and the polymerase chain reaction. Modern biology and biochemistry 
make intensive use of these techniques in recombinant DNA technology. Recombinant DNA is a man-made DNA 
sequence that has been assembled from other DNA sequences. They can be transformed into organisms in the form 
of plasmids or in the appropriate format, by using a viral vector/ 111 ^ The genetically modified organisms produced 

can be used to produce products such as recombinant proteins, used in medical research,^ 1 or be grown in 

• u [113] [114] 
agriculture. 



Forensics 

Forensic scientists can use DNA in blood, semen, skin, saliva or hair found at a crime scene to identify a matching 
DNA of an individual, such as a perpetrator. This process is called genetic fingerprinting, or more accurately, DNA 
profiling. In DNA profiling, the lengths of variable sections of repetitive DNA, such as short tandem repeats and 
minisatellites, are compared between people. This method is usually an extremely reliable technique for identifying a 
matching DNAJ 115 ^ However, identification can be complicated if the scene is contaminated with DNA from several 
people. [116] DNA profiling was developed in 1984 by British geneticist Sir Alec Jeffreys,^ 117 ' and first used in 
forensic science to convict Colin Pitchfork in the 1988 Enderby murders case J 1 

People convicted of certain types of crimes may be required to provide a sample of DNA for a database. This has 
helped investigators solve old cases where only a DNA sample was obtained from the scene. DNA profiling can also 
be used to identify victims of mass casualty incidents J 1 On the other hand, many convicted people have been 
released from prison on the basis of DNA techniques, which were not available when a crime had originally been 
committed. 



Bioinformatics 

Bioinformatics involves the manipulation, searching, and data mining of DNA sequence data. The development of 
techniques to store and search DNA sequences have led to widely applied advances in computer science, especially 
string searching algorithms, machine learning and database theory J 120 ^ String searching or matching algorithms, 
which find an occurrence of a sequence of letters inside a larger sequence of letters, were developed to search for 
specific sequences of nucleotides J 1 In other applications such as text editors, even simple algorithms for this 
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problem usually suffice, but DNA sequences cause these algorithms to exhibit near- worst-case behaviour due to their 

small number of distinct characters. The related problem of sequence alignment aims to identify homologous 

sequences and locate the specific mutations that make them distinct. These techniques, especially multiple sequence 

[1221 

alignment, are used in studying phylogenetic relationships and protein function. Data sets representing entire 
genomes' worth of DNA sequences, such as those produced by the Human Genome Project, are difficult to use 
without annotations, which label the locations of genes and regulatory elements on each chromosome. Regions of 
DNA sequence that have the characteristic patterns associated with protein- or RNA-coding genes can be identified 
by gene finding algorithms, which allow researchers to predict the presence of particular gene products in an 

n 23] 

organism even before they have been isolated experimentally. 



DNA nanotechnology 



DNA nanotechnology uses the unique 
molecular recognition properties of 
DNA and other nucleic acids to create 
self-assembling branched DNA 
complexes with useful properties' 124 ^ 
DNA is thus used as a structural 
material rather than as a carrier of 
biological information. This has led to 
the creation of two-dimensional 
periodic lattices (both tile-based as 
well as using the "DNA origami" 
method) as well as three-dimensional 
structures in the shapes of 
polyhedra. ^ 1 251 Nanomechanical 
devices and algorithmic self-assembly 
have also been demonstrated,^ 126 ^ and 

these DNA structures have been used to template the arrangement of other molecules such as gold nanoparticles and 
streptavidin proteins J 1 271 




The DNA structure at left (schematic shown) will self-assemble into the structure 
visualized by atomic force microscopy at right. DNA nanotechnology is the field which 
seeks to design nanoscale structures using the molecular recognition properties of DNA 
molecules. Image from Strong, 2004 (doi:10.1371/journal.pbio.0020073). 



History and anthropology 

Because DNA collects mutations over time, which are then inherited, it contains historical information and by 

IT281 

comparing DNA sequences, geneticists can infer the evolutionary history of organisms, their phylogeny. This 
field of phylogenetics is a powerful tool in evolutionary biology. If DNA sequences within a species are compared, 
population geneticists can learn the history of particular populations. This can be used in studies ranging from 
ecological genetics to anthropology; for example, DNA evidence is being used to try to identify the Ten Lost Tribes 
oflsrael. [129][130] 

DNA has also been used to look at modern family relationships, such as establishing family relationships between 
the descendants of Sally Hemings and Thomas Jefferson. This usage is closely related to the use of DNA in criminal 
investigations detailed above. Indeed, some criminal investigations have been solved when DNA from crime scenes 
has matched relatives of the guilty individuals 131 ^ 
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History of DNA research 

DNA was first isolated by the Swiss physician Friedrich Miescher who, in 1869, discovered a microscopic substance 

n 32] 

in the pus of discarded surgical bandages. As it resided in the nuclei of cells, he called it "nuclein". In 1919, 

r i33i 

Phoebus Levene identified the base, sugar and phosphate nucleotide unit. Levene suggested that DNA consisted 
of a string of nucleotide units linked together through the phosphate groups. However, Levene thought the chain was 
short and the bases repeated in a fixed order. In 1937 William Astbury produced the first X-ray diffraction patterns 
that showed that DNA had a regular structure J- 134 ^ 

In 1928, Frederick Griffith discovered that traits of the "smooth" form of the Pneumococcus could be transferred to 

the "rough" form of the same bacteria by mixing killed "smooth" bacteria with the live "rough" formJ 135 ^ This 

system provided the first clear suggestion that DNA carried genetic information — the Avery-MacLeod-McCarty 

experiment — when Oswald Avery, along with coworkers Colin MacLeod and Maclyn McCarty, identified DNA as 

the transforming principle in 1943^ 136 ^ DNA's role in heredity was confirmed in 1952, when Alfred Hershey and 

ri37i 

Martha Chase in the Hershey-Chase experiment showed that DNA is the genetic material of the T2 phage. 
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Rosalind Franklin 




Raymond Gosling 



In 1953 James D. Watson and Francis Crick suggested what is now accepted as the first correct double-helix model 

of DNA structure in the journal Nature. Their double-helix, molecular model of DNA was then based on a single 

n 38i 

X-ray diffraction image (labeled as "Photo 51") taken by Rosalind Franklin and Raymond Gosling in May 1952, 
as well as the information that the DNA bases were paired — also obtained through private communications from 
Erwin Chargaff in the previous years. Chargaff s rules played a very important role in establishing double-helix 
configurations for B-DNA as well as A-DNA. 

Experimental evidence supporting the Watson and Crick model were published in a series of five articles in the same 
ri39i 

issue of Nature. Of these, Franklin and Gosling's paper was the first publication of their own X-ray diffraction 
data and original analysis method that partially supported the Watson and Crick model ^ ^ 140 ^ ; this issue also 
contained an article on DNA structure by Maurice Wilkins and two of his colleagues, whose analysis and in vivo 
B-DNA X-ray patterns also supported the presence in vivo of the double-helical DNA configurations as proposed by 
Crick and Watson for their double-helix molecular model of DNA in the previous two pages of Nature. In 1962, 
after Franklin's death, Watson, Crick, and Wilkins jointly received the Nobel Prize in Physiology or Medicine/ 141 ^ 
Unfortunately, Nobel rules of the time allowed only living recipients, but a vigorous debate continues on who should 
receive credit for the discovery/ 142 ^ 
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In an influential presentation in 1957, Crick laid out the "Central Dogma" of molecular biology, which foretold the 
relationship between DNA, RNA, and proteins, and articulated the "adaptor hypothesis" J 143] Final confirmation of 
the replication mechanism that was implied by the double-helical structure followed in 1958 through the 
Meselson-Stahl experiments 144 ^ Further work by Crick and coworkers showed that the genetic code was based on 
non-overlapping triplets of bases, called codons, allowing Har Gobind Khorana, Robert W. Holley and Marshall 
Warren Nirenberg to decipher the genetic code J 1451 These findings represent the birth of molecular biology. 



See also 

• Crystallography 

• DNA microarray 

• DNA sequencing 

• Genetic disorder 

• Junk DNA 

• Molecular models of DNA 

• Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid 

• Nucleic acid analogues 

• Nucleic acid methods 

• Nucleic acid modeling 

• Nucleic Acid Notations 

• Paracrystal model and theory 

• X-ray crystallography 

• X-ray scattering 

• Phosphoramidite 

• Plasmid 

• Polymerase chain reaction 

• ProteopediaDNA [l46] 

• Southern blot 

• Triple-stranded DNA 



Further reading 

• Calladine, Chris R.; Drew, Horace R.; Luisi, Ben F. and Travers, Andrew A. (2003). Understanding DNA: the 
molecule & how it works. Amsterdam: Elsevier Academic Press. ISBN 0-12-155089-3. 

• Dennis, Carina; Julie Clayton (2003). 50 years of DNA. Basingstoke: Palgrave Macmillan. ISBN 1-4039-1479-6. 

• Judson, Horace Freeland (1996). The eighth day of creation: makers of the revolution in biology. Plain view, N.Y: 
CSHL Press. ISBN 0-87969-478-5. 

• Olby, Robert C. (1994). The path to the double helix: the discovery of DNA. New York: Dover Publications. 
ISBN 0-486-68117-3., first published in October 1974 by MacMillan, with foreword by Francis Crick;the 
definitive DNA textbook,re vised in 1994 with a 9 page postscript. 

• Olby, Robert C. (2009). Francis Crick: A Biography. Plainview, N.Y: Cold Spring Harbor Laboratory Press. 
ISBN 0-87969-798-9. 

• Ridley, Matt (2006). Francis Crick: discoverer of the genetic code. [Ashland, OH: Eminent Lives, Atlas Books. 
ISBN 0-06-082333-X. 

• Berry, Andrew; Watson, James D. (2003). DNA: the secret of life. New York: Alfred A. Knopf. 
ISBN 0-375-41546-7. 

• Stent, Gunther Siegmund; Watson, James D. (1980). The double helix: a personal account of the discovery of the 
structure of DNA. New York: Norton. ISBN 0-393-95075-1. 



DNA 



21 



• Wilkins, Maurice (2003). The third man of the double helix the autobiography of Maurice Wilkins. Cambridge, 
Eng: University Press. ISBN 0-19-860665-6. 



External links 

DNA ^ 147 ' at the Open Directory Project 
DNA binding site prediction on protein ^ 148 ^ 
DNA coiling to form chromosomes ^ 149 ^ 

DNA from the Beginning ^ 150 ^ Another DNA Learning Center site on DNA, genes, and heredity from Mendel to 
the human genome project. 

DNA Lab, demonstrates how to extract DNA from wheat using readily available equipment and supplies. tl51] 
DNA the Double Helix Game ^ 152 ^ From the official Nobel Prize web site 
DNA under electron microscope ^ 153 ^ 
Dolan DNA Learning Center [154] 
Double Helix: 50 years of DNA [155] , Nature 

Double Helix 1953-2003 [156] National Centre for Biotechnology Education 
Francis Crick and James Watson talking on the BBC in 1962, 1972, and 1974 [157] 
Genetic Education Modules for Teachers ^ 158 ^ — DNA from the Beginning Study Guide 
Guide to DNA cloning [159] 

Olby R (January 2003). "Quiet debut for the double helix" [160] . Nature 421 (6921): 402-5. 
doi:10.1038/nature01397. PMID 12540907. 
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Rosalind Franklin's contributions to the study of DNA ^ 162] 

The Register of Francis Crick Personal Papers 1938 - 2007 ^ 163 ^ at Mandeville Special Collections Library, Geisel 
Library, University of California, San Diego 

U.S. National DNA Day ^ 164 ^ — watch videos and participate in real-time chat with top scientists 
"Clue to chemistry of heredity found" ^ 165 ^. The New York Times. Saturday, June 13, 1953. The first American 
newspaper coverage of the discovery of the DNA structure. 

An Introduction to DNA and Chromosomes tl66] from HOPES: Huntington's Disease Outreach Project for 
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DNA sequence 



A DNA sequence or genetic sequence is a 

succession of letters representing the primary 
structure of a real or hypothetical DNA molecule or 
strand. The sequence has capacity to carry 
information. Sequences can be read from the 
biological raw material through DNA sequencing 
methods. 

AAACAACTTCGTAAGTATA 

The possible letters are A, C, G, and T, representing Electropherogram printout from automated sequencer for determining part 

the four nucleotide bases of a DNA strand — of a DNA sequence 




adenine, cytosine, guanine, thymine — covalently 

linked to a phosphodiester backbone. In the typical case, the sequences are printed abutting one another without 
gaps, as in the sequence AAAGTCTGAC, read left to right in the 5' to 3' direction. With regards to transcription, a 
sequence is on the coding strand if it has the same order as the transcribed RNA. 

One sequence can be complementary to another sequence, meaning that they have the base on each position is the 
complementary (i.e. A to T, C to G) and in the reverse order. For example, the complementary sequence to TTAC is 
GTAA. If one strand of the double-stranded DNA is considered the sense strand, then the other strand, considered 
the antisense strand, will have the complementary sequence to the sense strand. 

In some special cases, letters besides A, T, C, and G are present in a sequence. These letters represent ambiguity. Of 
all the molecules sampled, there is more than one kind of nucleotide at that position. The rules of the International 
Union of Pure and Applied Chemistry (IUPAC) are as follows :^ 

A = adenine 
C = cytosine 
G = guanine 
T = thymine 
R = G A (purine) 

Y = T C (pyrimidine) 
K = G T (keto) 
M = A C (amino) 
S = G C (strong bonds) 
W = A T (weak bonds) 
B = G T C (all but A) 
D = G A T (all but C) 
H = A C T (all but G) 

V = G C A (all but T) 
N = A G C T (any) 
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See also 

• DNA 

• DNA sequencing 

• Single-nucleotide polymorphism (SNP) 

• Sequence alignment 

• Sequence analysis 

• Sequence motif 

External links 

• A bibliography on features, patterns, correlations in DNA and protein texts 
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DNA sequencing 

The term DNA sequencing refers to sequencing methods for determining the order of the nucleotide 
bases — adenine, guanine, cytosine, and thymine — in a molecule of DNA. 

Knowledge of DNA sequences has become indispensable for basic biological research, other research branches 
utilizing DNA sequencing, and in numerous applied fields such as diagnostic, biotechnology, forensic biology and 
biological systematics. The advent of DNA sequencing has significantly accelerated biological research and 
discovery. The rapid speed of sequencing attained with modern DNA sequencing technology has been instrumental 
in the sequencing of the human genome, in the Human Genome Project. Related projects, often by scientific 
collaboration across continents, have generated the complete DNA sequences of many animal, plant, and microbial 
genomes. 

The first DNA sequences were obtained in the early 1970s by academic researchers using laborious methods based 
on two-dimensional chromatography. Following the development of dye-based sequencing methods with automated 
analysis,^ DNA sequencing has become easier and orders of magnitude faster. 



History 

RNA sequencing was one of the 
earliest forms of nucleotide 
sequencing. The major landmark of 
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RNA sequencing is the sequence of the DNA Sequence Trace 

first complete gene and the complete 

genome of Bacteriophage MS2, identified and published by Walter Fiers and his coworkers at the University of 
Ghent (Ghent, Belgium), between 1972 [2] and 1976. [3] 

Prior to the development of rapid DNA sequencing methods in the early 1970s by Frederick Sanger at the University 
of Cambridge, in England and Walter Gilbert and Allan Maxam at Harvard/ 4 ^ ^ a number of laborious methods 
were used. For instance, in 1973, Gilbert and Maxam reported the sequence of 24 basepairs using a method known 
as wandering- spot analysis J 6 ^ 
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The chain-termination method developed by Sanger and coworkers in 1975 soon became the method of choice, 
owing to its relative ease and reliability ^ 

Maxam-Gilbert sequencing 

In 1976-1977, Allan Maxam and Walter Gilbert developed a DNA sequencing method based on chemical 
modification of DNA and subsequent cleavage at specific bases J 4 ^ Although Maxam and Gilbert published their 
chemical sequencing method two years after the ground-breaking paper of Sanger and Coulson on plus-minus 
sequencing, ^ ^ Maxam-Gilbert sequencing rapidly became more popular, since purified DNA could be used 
directly, while the initial Sanger method required that each read start be cloned for production of single- stranded 
DNA. However, with the improvement of the chain-termination method (see below), Maxam-Gilbert sequencing has 
fallen out of favour due to its technical complexity prohibiting its use in standard molecular biology kits, extensive 
use of hazardous chemicals, and difficulties with scale-up. 

The method requires radioactive labelling at one end and purification of the DNA fragment to be sequenced. 
Chemical treatment generates breaks at a small proportion of one or two of the four nucleotide bases in each of four 
reactions (G, A+G, C, C+T). Thus a series of labelled fragments is generated, from the radiolabeled end to the first 
"cut" site in each molecule. The fragments in the four reactions are arranged side by side in gel electrophoresis for 
size separation. To visualize the fragments, the gel is exposed to X-ray film for autoradiography, yielding a series of 
dark bands each corresponding to a radiolabeled DNA fragment, from which the sequence may be inferred. 

Also sometimes known as "chemical sequencing", this method originated in the study of DNA-protein interactions 
(footprinting), nucleic acid structure and epigenetic modifications to DNA, and within these it still has important 
applications. 
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Chain-termination methods 



A T G C 



Because the chain-terminator method (or Sanger method after its 
developer Frederick Sanger) is more efficient and uses fewer toxic 
chemicals and lower amounts of radioactivity than the method of Maxam 
and Gilbert, it rapidly became the method of choice. The key principle of 
the Sanger method was the use of dideoxynucleotide triphosphates 
(ddNTPs) as DNA chain terminators. 

The classical chain-termination method requires a single- stranded DNA 
template, a DNA primer, a DNA polymerase, radioactively or 
fluorescently labeled nucleotides, and modified nucleotides that 
terminate DNA strand elongation. The DNA sample is divided into four 
separate sequencing reactions, containing all four of the standard 
deoxynucleotides (dATP, dGTP, dCTP and dTTP) and the DNA 
polymerase. To each reaction is added only one of the four 
dideoxynucleotides (ddATP, ddGTP, ddCTP, or ddTTP) which are the 
chain-terminating nucleotides, lacking a 3'-OH group required for the 
formation of a phosphodiester bond between two nucleotides, thus 
terminating DNA strand extension and resulting in DNA fragments of 
varying length. 

The newly synthesized and labeled DNA fragments are heat denatured, 
and separated by size (with a resolution of just one nucleotide) by gel 
electrophoresis on a denaturing polyacrylamide-urea gel with each of the 
four reactions run in one of four individual lanes (lanes A, T, G, C); the 
DNA bands are then visualized by autoradiography or UV light, and the 

DNA sequence can be directly read off the X-ray film or gel image. In the image on the right, X-ray film was 
exposed to the gel, and the dark bands correspond to DNA fragments of different lengths. A dark band in a lane 
indicates a DNA fragment that is the result of chain termination after incorporation of a dideoxynucleotide (ddATP, 
ddGTP, ddCTP, or ddTTP). The relative positions of the different bands among the four lanes are then used to read 
(from bottom to top) the DNA sequence. 




Part of a radioactively labelled sequencing gel 
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Technical variations of chain-termination sequencing include tagging 
with nucleotides containing radioactive phosphorus for radiolabelling, 
or using a primer labeled at the 5' end with a fluorescent dye. 
Dye-primer sequencing facilitates reading in an optical system for 
faster and more economical analysis and automation. The later 
development by Leroy Hood and coworkers ^ of fluorescently 
labeled ddNTPs and primers set the stage for automated, 
high-throughput DNA sequencing. 



DNA fragments are labeled with a radioactive or 
fluorescent tag on the primer (1), in the new DNA 
strand with a labeled dNTP, or with a labeled 
ddNTP. (click to expand) 



Chain-termination methods have greatly simplified DNA sequencing. 
For example, chain-termination-based kits are commercially available 
that contain the reagents needed for sequencing, pre-aliquoted and 
ready to use. Limitations include non-specific binding of the primer to 
the DNA, affecting accurate read-out of the DNA sequence, and DNA 
secondary structures affecting the fidelity of the sequence. 
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Sequence ladder by radioactive sequencing 
compared to fluorescent peaks (click to expand) 
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Dye-terminator sequencing 




Fluorescence 
detector 



Dye-terminator sequencing utilizes labelling of the chain terminator 
ddNTPs, which permits sequencing in a single reaction, rather than 
four reactions as in the labelled-primer method. In dye-terminator 
sequencing, each of the four dideoxynucleotide chain terminators is 
labelled with fluorescent dyes, each of which with different 
wavelengths of fluorescence and emission. Owing to its greater 
expediency and speed, dye-terminator sequencing is now the mainstay 
in automated sequencing. Its limitations include dye effects due to 
differences in the incorporation of the dye-labelled chain terminators 
into the DNA fragment, resulting in unequal peak heights and shapes in the electronic DNA sequence trace 
chromatogram after capillary electrophoresis (see figure to the left). This problem has been addressed with the use of 
modified DNA polymerase enzyme systems and dyes that minimize incorporation variability, as well as methods for 
eliminating "dye blobs". The dye-terminator sequencing method, along with automated high-throughput DNA 
sequence analyzers, is now being used for the vast majority of sequencing projects. 



Capillary electrophoresis (click to expand) 



Challenges 

Common challenges of DNA sequencing include poor quality in the first 15—40 bases of the sequence and 
deteriorating quality of sequencing traces after 700-900 bases. Base calling software typically gives an estimate of 
quality to aid in quality trimming. 

In cases where DNA fragments are cloned before sequencing, the resulting sequence may contain parts of the 
cloning vector. In contrast, PCR-based cloning and emerging sequencing technologies based on pyro sequencing 
often avoid using cloning vectors. 




Automation and sample preparation 

Automated DNA- sequencing instruments (DNA sequencers) can 
sequence up to 384 DNA samples in a single batch (run) in up to 24 
runs a day. DNA sequencers carry out capillary electrophoresis for size 
separation, detection and recording of dye fluorescence, and data 
output as fluorescent peak trace chromatograms. Sequencing reactions 
by thermocy cling, cleanup and re- suspension in a buffer solution 

before loading onto the sequencer are performed separately. A number of commercial and non-commercial software 
packages can trim low-quality DNA traces automatically. These programs score the quality of each peak and remove 
low-quality base peaks (generally located at the ends of the sequence). The accuracy of such algorithms is below 
visual examination by a human operator, but sufficient for automated processing of large sequence data sets. 



View of the start of an example dye-terminator 
read (click to expand) 
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DNA extraction 
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Large-scale sequencing strategies 

Current methods can directly sequence only relatively short (300-1000 nucleotides long) DNA fragments in a single 
ri2i 

reaction. The main obstacle to sequencing DNA fragments above this size limit is insufficient power of 
separation for resolving large DNA fragments that differ in length by only one nucleotide. 

Large-scale sequencing aims at sequencing very long DNA pieces, 
such as whole chromosomes. Common approaches consist of cutting 
(with restriction enzymes) or shearing (with mechanical forces) large 
DNA fragments into shorter DNA fragments. The fragmented DNA is 
cloned into a DNA vector, and amplified in Escherichia coli. Short 
DNA fragments purified from individual bacterial colonies are 
individually sequenced and assembled electronically into one long, 
contiguous sequence. This method does not require any pre-existing 
information about the sequence of the DNA and is referred to as de 
novo sequencing. Gaps in the assembled sequence may be filled by 
primer walking. The different strategies have different tradeoffs in 
speed and accuracy; shotgun methods are often used for sequencing 
large genomes, but its assembly is complex and difficult, particularly 
with sequence repeats often causing gaps in genome assembly. 



Clone into Vectors 

© © © 

Transform bacteria, grow, isolate vector DNA 
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Sequence the library 
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Assemble contiguous fragments 



Genomic DNA is fragmented into random pieces 
and cloned as a bacterial library. DNA from 

individual bacterial clones is sequenced and the 
sequence is assembled by using overlapping 
DNA regions. (click to expand) 



New sequencing methods 



High-throughput sequencing 



The high demand for low-cost sequencing has driven the development 
of high- throughput sequencing technologies that parallelize the sequencing process, producing thousands or millions 
of sequences at onceJ 13 ^ ^ High-throughput sequencing technologies are intended to lower the cost of DNA 
sequencing beyond what is possible with standard dye-terminator methods J 15 ^ 



In vitro clonal amplification 

Most sequencing approaches use an in vitro cloning step to amplify individual DNA molecules, because their 
molecular detection methods are not sensitive enough for single molecule sequencing. Emulsion PCR isolates 
individual DNA molecules along with primer-coated beads in aqueous droplets within an oil phase. Polymerase 
chain reaction (PCR) then coats each bead with clonal copies of the DNA molecule followed by immobilization for 
later sequencing. Emulsion PCR is used in the methods by Marguilis et al. (commercialized by 454 Life Sciences), 
Shendure and Porreca et al. (also known as "Polony sequencing") and SOLiD sequencing, (developed by Agencourt, 
now Applied Biosystems)J 16 ^ ^ ^ Another method for in vitro clonal amplification is bridge PCR, where 
fragments are amplified upon primers attached to a solid surface, used in the Illumina Genome Analyzer. The 
single-molecule method developed by Stephen Quake's laboratory (later commercialized by Helicos) is an exception: 
it uses bright fluorophores and laser excitation to detect pyro sequencing events from individual DNA molecules 
fixed to a surface, eliminating the need for molecular amplification J- 
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Parallelized sequencing 

DNA molecules are physically bound to a surface, and sequenced in parallel. Sequencing by synthesis, like 
dye-termination electrophoretic sequencing, uses a DNA polymerase to determine the base sequence. Reversible 
terminator methods (used by Illumina and Helicos) use reversible versions of dye-terminators, adding one nucleotide 
at a time, detect fluorescence at each position in real time, by repeated removal of the blocking group to allow 
polymerization of another nucleotide. Pyro sequencing (used by 454) also uses DNA polymerization, adding one 
nucleotide species at a time and detecting and quantifying the number of nucleotides added to a given location 
through the light emitted by the release of attached pyrophosphates.^ 16 ^ ^ 

Sequencing by ligation 

n 71 n 8i r2i i 

Sequencing by ligation uses a DNA ligase to determine the target sequence. Used in the polony method 

and in the SOLiD technology, it uses a pool of all possible oligonucleotides of a fixed length, labeled according to 
the sequenced position. Oligonucleotides are annealed and ligated; the preferential ligation by DNA ligase for 
matching sequences results in a signal informative of the nucleotide at that position. 

Microfluidic Sanger sequencing 

In microfluidic Sanger sequencing the entire thermocy cling amplification of DNA fragments as well as their 
separation by electrophoresis is done on a single glass wafer (approximately 10 cm in diameter) thus reducing the 
reagent usage as well as cost. In some instances researchers have shown that they can increase the throughput of 
conventional sequencing through the use of microchips. Research will still need to be done in order to make this use 
of technology effective. 

Other sequencing technologies 

Sequencing by hybridization is a non-enzymatic method that uses a DNA microarray. A single pool of DNA whose 
sequence is to be determined is fluorescently labeled and hybridized to an array containing known sequences. Strong 
hybridization signals from a given spot on the array identifies its sequence in the DNA being sequenced. Mass 
spectrometry may be used to determine mass differences between DNA fragments produced in chain-termination 
reactions. t23] 

DNA sequencing methods currently under development include labeling the DNA polymerase, 17 ^ reading the 
sequence as a DNA strand transits through nanoporesj 25 ^ ^ and microscopy-based techniques, such as AFM or 
electron microscopy that are used to identify the positions of individual nucleotides within long DNA fragments 
(>5,000 bp) by nucleotide labeling with heavier elements (e.g., halogens) for visual detection and recording. 

In October 2006, the X Prize Foundation established an initiative to promote the development of full genome 
sequencing technologies, called the Archon X Prize, intending to award $10 million to "the first Team that can build 
a device and use it to sequence 100 human genomes within 10 days or less, with an accuracy of no more than one 
error in every 100,000 bases sequenced, with sequences accurately covering at least 98% of the genome, and at a 

T281 

recurring cost of no more than $10,000 (US) per genome.' 
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Major landmarks in DNA sequencing 

• 1953 Discovery of the structure of the DNA double helix. 

• 1972 Development of recombinant DNA technology, which permits isolation of defined fragments of DNA; prior 
to this, the only accessible samples for sequencing were from bacteriophage or virus DNA. 

• 1975 The first complete DNA genome to be sequenced is that of bacteriophage cpX174 

Ml 

• 1977 Allan Maxam and Walter Gilbert publish "DNA sequencing by chemical degradation". Frederick Sanger, 
independently, publishes "DNA sequencing by enzymatic synthesis". 

• 1980 Frederick Sanger and Walter Gilbert receive the Nobel Prize in Chemistry 

• 1984 Medical Research Council scientists decipher the complete DNA sequence of the Epstein-Barr virus, 170 
kb. 

• 1986 Leroy E. Hood's laboratory at the California Institute of Technology and Smith announce the first 
semi-automated DNA sequencing machine. 

• 1987 Applied Biosy stems markets first automated sequencing machine, the model ABI 370. 

• 1990 The U.S. National Institutes of Health (NIH) begins large-scale sequencing trials on Mycoplasma 
capricolum, Escherichia coli, Caenorhabditis elegans, and Saccharomyces cerevisiae (at 75 cents (US)/base). 

• 1995 Craig Venter, Hamilton Smith, and colleagues at The Institute for Genomic Research (TIGR) publish the 
first complete genome of a free-living organism, the bacterium Haemophilus influenzae. The circular 
chromosome contains 1,830,137 bases and its publication in the journal Science marks the first use of 
whole-genome shotgun sequencing, eliminating the need for initial mapping efforts. 

• 1995 Richard Mathies et al. publish fluorescence energy transfer dye-based sequencing. 13 ^ 

• 1996 Pal Nyren and his student Mostafa Ronaghi at the Royal Institute of Technology in Stockholm publish their 
method of pyro sequencing ^ 20 ^ 

• 1998 Phil Green and Brent Ewing of the University of Washington publish "phred" for sequencer data analysis. 



See also 

• Full genome sequencing 

• Genome project 

• Cancer genome sequencing 

• Single Molecule Real Time Sequencing 

• Applied Biosystems 

• 454 Life Sciences 

• Illumina (company) 

• Complete Genomics 

• Joint Genome Institute 

• DNA field-effect transistor 

• DNA sequencing theory 
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External links 

• Disruptive Gene Sequencing technology - Single Molecule Real Time (SMRT) sequencing 
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DNA profiling 

DNA profiling (also called DNA testing, DNA typing, or genetic fingerprinting) is a technique employed by 
forensic scientists to assist in the identification of individuals on the basis of their respective DNA profiles. DNA 
profiles are encrypted sets of numbers that reflect a person's DNA makeup, which can also be used as the person's 
identifier. DNA profiling should not be confused with full genome sequencing.^ It is used in, for example, parental 
testing and rape investigation. 

Although 99.9% of human DNA sequences are the same in every person, enough of the DNA is different to 

T21 

distinguish one individual from another. 1 DNA profiling uses repetitive ("repeat") sequences that are highly 
T21 

variable, 1 called variable number tandem repeats (VNTR). VNTRs loci are very similar between closely related 
humans, but so variable that unrelated individuals are extremely unlikely to have the same VNTRs. 

T31 

The DNA profiling technique was first reported in 1984 L by Sir Alec Jeffreys at the University of Leicester in 
England,^ and is now the basis of several national DNA databases. Dr. Alec Jeffrey's genetic fingerprinting was 
made commercially available in 1987, when a chemical company, ICI, started a blood-testing center in England 

DNA profiling process 

The process begins with a sample of an individual's DNA (typically called a "reference sample"). The most desirable 
method of collecting a reference sample is the use of a buccal swab, as this reduces the possibility of contamination. 
When this is not available (e.g. because a court order may be needed and not obtainable) other methods may need to 
be used to collect a sample of blood, saliva, semen, or other appropriate fluid or tissue from personal items (e.g. 
toothbrush, razor, etc) or from stored samples (e.g. banked sperm or biopsy tissue). Samples obtained from blood 
relatives (biological relative) can provide an indication of an individual's profile, as could human remains which had 
been previously profiled. 

A reference sample is then analyzed to create the individual's DNA profile using one of a number of techniques, 
discussed below. The DNA profile is then compared against another sample to determine whether there is a genetic 
match. 
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RFLP analysis 

The first methods for finding out 
genetics used for DNA profiling 
involved restriction enzyme digestion, 
followed by Southern blot analysis. 
Although polymorphisms can exist in 
the restriction enzyme cleavage sites, 
more commonly the enzymes and 
DNA probes were used to analyze 
VNTR loci. However, the Southern 
blot technique is laborious, and 
requires large amounts of undegraded 
sample DNA. Also, Karl Brown's 

original technique looked at many minisatellite loci at the same time, increasing the observed variability, but making 
it hard to discern individual alleles (and thereby precluding parental testing). These early techniques have been 
supplanted by PCR-based assays. 




Variations of VNTR allele lengths in 6 individuals. 



PCR analysis 

With the invention of the polymerase chain reaction (PCR) technique, DNA profiling took huge strides forward in 
both discriminating power and the ability to recover information from very small (or degraded) starting samples. 
PCR greatly amplifies the amounts of a specific region of DNA, using oligonucleotide primers and a thermostable 
DNA polymerase. Early assays such as the HLA-DQ alpha reverse dot blot strips grew to be very popular due to 
their ease of use, and the speed with which a result could be obtained. However they were not as discriminating as 
RFLP. It was also difficult to determine a DNA profile for mixed samples, such as a vaginal swab from a sexual 
assault victim. 

Fortunately, the PCR method is readily adaptable for analyzing VNTR loci. In the United States the FBI has 
standardized a set of 13 VNTR assays for DNA typing, and has organized the CODIS database for forensic 
identification in criminal cases. Similar assays and databases have been set up in other countries. Also, commercial 
kits are available that analyze single-nucleotide polymorphisms (SNPs). These kits use PCR to amplify specific 
regions with known variations and hybridize them to probes anchored on cards, which results in a colored spot 
corresponding to the particular sequence variation. 

STR analysis 

The method of DNA profiling used today is based on PCR and uses short tandem repeats (STR). This method uses 
highly polymorphic regions that have short repeated sequences of DNA (the most common is 4 bases repeated, but 
there are other lengths in use, including 3 and 5 bases). Because different unrelated people have different numbers of 
repeat units, these regions of DNA can be used to discriminate between unrelated individuals. These STR loci 
(locations) are targeted with sequence- specific primers and are amplified using PCR. The DNA fragments that result 
are then separated and detected using electrophoresis. There are two common methods of separation and detection, 
capillary electrophoresis (CE) and gel electrophoresis. 

The polymorphisms displayed at each STR region are by themselves very common, typically each polymorphism 
will be shared by around 5 - 20% of individuals. When looking at multiple loci, it is the unique combination of these 
polymorphisms to an individual that makes this method discriminating as an identification tool. The more STR 
regions that are tested in an individual the more discriminating the test becomes. 
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From country to country, different STR-based DNA-profiling systems are in use. In North America systems which 
amplify the CODIS 13 core loci are almost universal, while in the UK the SGM+ system, which is compatible with 
The National DNA Database, is in use. Whichever system is used, many of the STR regions under test are the same. 
These DNA-profiling systems are based on multiplex reactions, whereby many STR regions will be under test at the 
same time. 

The true power of STR analysis is in its statistical power of discrimination. Because the 13 loci that are currently 
used for discrimination in CODIS are independently assorted (having a certain number of repeats at one locus doesn't 
change the likelihood of having any number of repeats at any other locus), the product rule for probabilities can be 
applied. This means that if someone has the DNA type of ABC, where the three loci were independent, we can say 
that the probability of having that DNA type is the probability of having type A times the probability of having type 
B times the probability of having type C. This has resulted in the ability to generate match probabilities of 1 in a 
quintillion (1 with 18 zeros after it) or more. 

However, DNA database searches showed much more frequent than expected false DNA matches including one 
perfect 13 locus match out of only 30,000 DNA samples in Maryland in January 2007. ^ Moreover, since there are 
about 12 million monozygotic twins on Earth, that theoretical probability is useless. For example, the actual 
probability that 2 random persons have the same DNA depends on whether there were twins or triplets (etc.) in the 
family, and the number of loci used in the test. Where twins are common, the probability of matching the DNA is 22 
in 1000, or about 2.2 in 100 will have matching DNA. 

In practice, the risk of contaminated-matching is much greater than matching a distant relative, such as a sample 
being contaminated from nearby objects, or from left-over cells transferred from a prior test. Logically, the risk is 
greater for matching the most common person in the samples: everything collected from, or in contact with, a victim 
is a major source of contamination for any other samples brought into a lab. For that reason, multiple 
control- samples are typically tested, to ensure that they stayed clean, when prepared during the same period as the 
actual test samples. Unexpected matches (or variations) in several control-samples indicates a high probability of 
contamination for the actual test samples. In a relationship test, the full DNA profiles should differ (except for 
twins), to prove that a person wasn't actually matched as being related to their own DNA in another sample. 

AmpFLP 

Another technique, AmpFLP, or amplified fragment length polymorphism was also put into practice during the early 
1990s. This technique was also faster than RFLP analysis and used PCR to amplify DNA samples. It relied on 
variable number tandem repeat (VNTR) polymorphisms to distinguish various alleles, which were separated on a 
poly aery lamide gel using an allelic ladder (as opposed to a molecular weight ladder). Bands could be visualized by 
silver staining the gel. One popular locus for fingerprinting was the D1S80 locus. As with all PCR based methods, 
highly degraded DNA or very small amounts of DNA may cause allelic dropout (causing a mistake in thinking a 
heterozygote is a homozygote) or other stochastic effects. In addition, because the analysis is done on a gel, very 
high number repeats may bunch together at the top of the gel, making it difficult to resolve. AmpFLP analysis can be 
highly automated, and allows for easy creation of phylogenetic trees based on comparing individual samples of 
DNA. Due to its relatively low cost and ease of set-up and operation, AmpFLP remains popular in lower income 
countries. 
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DNA family relationship analysis 

Using PCR technology, DNA analysis is widely applied to determine genetic family relationships such as paternity, 
maternity, siblingship and other kinships. 

During conception, the father's sperm cell and the mother's egg cell, each containing half the amount of DNA found 
in other body cells, meet and fuse to form a fertilized egg, called a zygote. The zygote contains a complete set of 
DNA molecules, a unique combination of DNA from both parents. This zygote divides and multiplies into an 
embryo and later, a full human being. 

DNA does not change once it is formed at conception. At each stage of development, all the cells forming the body 
contain the same DNA — half from the father and half from the mother. This fact allows the relationship testing to 
use all types of all samples including loose cells from the cheeks collected using buccal swabs, blood or other types 
of samples. 

While a lot of DNA contains information for a certain function, there is some called junk DNA, which is currently 
used for human identification. At some special locations (called loci) in the junk DNA, predictable inheritance 
patterns were found to be useful in determining biological relationships. These locations contain specific DNA 
markers that DNA scientists use to identify individuals. In a routine DNA paternity test, the markers used are Short 
Tandem Repeats (STRs), short pieces of DNA that occur in highly differential repeat patterns among individuals. 

Each person's DNA contains two copies of these markers — one copy inherited from the father and one from the 
mother. Within a population, the markers at each person's DNA location could differ in length and sometimes 
sequence, depending on the markers inherited from the parents. 

The combination of marker sizes found in each person makes up his/her unique genetic profile. When determining 
the relationship between two individuals, their genetic profiles are compared to see if they share the same inheritance 
patterns at a statistically conclusive rate. 

For example, the following sample report from this commercial DNA paternity testing laboratory Universal Genetics 

1-7-1 

signifies how relatedness between parents and child is identified on those special markers: 



DNA Marker 


Mother 


Child 


Alleged Father 


D21S11 


28, 30 


28,31 


29,31 


D7S820 


9, 10 


10, 11 


11, 12 


TH01 


14, 15 


14, 16 


15, 16 


D13S317 


7,8 


7,9 


8,9 


D19S433 


14, 16.2 


14, 15 


15, 17 



The partial results indicate that the child and the alleged father's DNA match among these five markers. The 
complete test results show this correlation on 16 markers between the child and the tested man to draw a conclusion 
of whether or not the man is the biological father. 

Scientifically, each marker is assigned with a Paternity Index (PI), which is a statistical measure of how powerfully a 
match at a particular marker indicates paternity. The PI of each marker is multiplied with each other to generate the 
Combined Paternity Index (CPI), which indicates the overall probability of an individual being the biological father 
of the tested child relative to any random man from the entire population of the same race. The CPI is then converted 
into a Probability of Paternity showing the degree of relatedness between the alleged father and child. 

The DNA test report in other family relationship tests, such as grandparentage and siblingship tests, is similar to a 
paternity test report. Instead of the Combined Paternity Index, a different value, such as a Siblingship Index, is 
reported. 

The report shows the genetic profiles of each tested person. If there are markers shared among the tested individuals, 
the probability of biological relationship is calculated to determine how likely the tested individuals share the same 
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markers due to a blood relationship. 
Y-chromosome analysis 

Recent innovations have included the creation of primers targeting polymorphic regions on the Y-chromosome 
(Y-STR), which allows resolution of a mixed DNA sample from a male and female and/or cases in which a 
differential extraction is not possible. Y-chromosomes are paternally inherited, so Y-STR analysis can help in the 
identification of paternally related males. Y-STR analysis was performed in the Sally Hemings controversy to 
determine if Thomas Jefferson had sired a son with one of his slaves. 

Mitochondrial analysis 

For highly degraded samples, it is sometimes impossible to get a complete profile of the 13 CODIS STRs. In these 
situations, mitochondrial DNA (mtDNA) is sometimes typed due to there being many copies of mtDNA in a cell, 
while there may only be 1-2 copies of the nuclear DNA. Forensic scientists amplify the HV1 and HV2 regions of the 
mtDNA, then sequence each region and compare single-nucleotide differences to a reference. Because mtDNA is 
maternally inherited, directly linked maternal relatives can be used as match references, such as one's maternal 
grandmother's daughter's son. A difference of two or more nucleotides is generally considered to be an exclusion. 
Heteroplasmy and poly-C differences may throw off straight sequence comparisons, so some expertise on the part of 
the analyst is required. mtDNA is useful in determining clear identities, such as those of missing persons when a 
maternally linked relative can be found. mtDNA testing was used in determining that Anna Anderson was not the 
Russian princess she had claimed to be, Anastasia Romanov. 

mtDNA can be obtained from such material as hair shafts and old bones/teeth.. 

DNA databases 

There are now several DNA databases in existence around the world. Some are private, but most of the largest 
databases are government controlled. The United States maintains the largest DNA database, with the Combined 
DNA Index System, holding over 5 million records as of 2007 The United Kingdom maintains the National DNA 
Database (NDNAD), which is of similar size, despite the UK's smaller population. The size of this database, and its 
rate of growth, is giving concern to civil liberties groups in the UK, where police have wide-ranging powers to take 

rm 

samples and retain them even in the event of acquittal. 

The U.S. Patriot Act of the United States provides a means for the U.S. government to get DNA samples from other 
countries if they are either a division of, or head office of, a company operating in the U.S. Under the act, the 
American offices of the company can't divulge to their subsidiaries/offices in other countries the reasons that these 
DNA samples are sought or by whom. 

When a match is made from a National DNA Databank to link a crime scene to an offender who has provided a 
DNA Sample to a databank that link is often referred to as a cold hit. A cold hit is of value in referring the police 
agency to a specific suspect but is of less evidential value than a DNA match made from outside the DNA 
Databank.™ 

Considerations when evaluating DNA evidence 

In the early days of the use of genetic fingerprinting as criminal evidence, juries were often swayed by spurious 
statistical arguments by defense lawyers along these lines: given a match that had a 1 in 5 million probability of 
occurring by chance, the lawyer would argue that this meant that in a country of say 60 million people there were 12 
people who would also match the profile. This was then translated to a 1 in 12 chance of the suspect being the guilty 
one. This argument is not sound unless the suspect was drawn at random from the population of the country. In fact, 
a jury should consider how likely it is that an individual matching the genetic profile would also have been a suspect 
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in the case for other reasons. Another spurious statistical argument is based on the false assumption that a 1 in 5 
million probability of a match automatically translates into a 1 in 5 million probability of innocence and is known as 
the prosecutor's fallacy. 

When using RFLP, the theoretical risk of a coincidental match is 1 in 100 billion (100,000,000,000), although the 

practical risk is actually 1 in 1000 because monozygotic twins are 0.2% of the human population. Moreover, the rate 

of laboratory error is almost certainly higher than this, and often actual laboratory procedures do not reflect the 

theory under which the coincidence probabilities were computed. For example, the coincidence probabilities may be 

calculated based on the probabilities that markers in two samples have bands in precisely the same location, but a 

laboratory worker may conclude that similar — but not precisely identical — band patterns result from identical 

genetic samples with some imperfection in the agarose gel. However, in this case, the laboratory worker increases 

the coincidence risk by expanding the criteria for declaring a match. Recent studies have quoted relatively high error 

rates which may be cause for concern/ 11 ^ In the early days of genetic fingerprinting, the necessary population data to 

accurately compute a match probability was sometimes unavailable. Between 1992 and 1996, arbitrary low ceilings 

were controversially put on match probabilities used in RFLP analysis rather than the higher theoretically computed 
ri2i 

ones. 1 Today, RFLP has become widely disused due to the advent of more discriminating, sensitive and easier 
technologies. 

STRs do not suffer from such subjectivity and provide similar power of discrimination (1 in 10 A 13 for unrelated 
individuals if using a full SGM+ profile) It should be noted that figures of this magnitude are not considered to be 
statistically supportable by scientists in the UK, for unrelated individuals with full matching DNA profiles a match 
probability of 1 in a billion is considered statistically supportable (Since 1998 the DNA profiling system supported 
by The National DNA Database in the UK is the SGM+ DNA profiling system which includes 10 STR regions and a 
sex indicating test. However, with any DNA technique, the cautious juror should not convict on genetic fingerprint 
evidence alone if other factors raise doubt. Contamination with other evidence (secondary transfer) is a key source of 
incorrect DNA profiles and raising doubts as to whether a sample has been adulterated is a favorite defense 
technique. More rarely, Chimerism is one such instance where the lack of a genetic match may unfairly exclude a 
suspect. 

Evidence of genetic relationship 

It's also possible to use DNA profiling as evidence of genetic relationship, but testing that shows no relationship isn't 

absolutely certain. While almost all individuals have a single and distinct set of genes, rare individuals, known as 

"chimeras", have at least two different sets of genes. There have been several cases of DNA profiling that falsely 

ri3i 

"proved" that a mother was unrelated to her children. 

Fake DNA evidence 

The value of DNA evidence has to be seen in light of recent cases where criminals planted fake DNA samples at 
crime scenes. In one case, a criminal even planted fake DNA evidence in his own body: Dr. John Schneeberger raped 
one of his sedated patients in 1992 and left semen on her underwear. Police drew what they believed to be 
Schneeberger's blood and compared its DNA against the crime scene semen DNA on three occasions, never showing 
a match. It turned out that he had surgically inserted a Penrose drain into his arm and filled it with foreign blood and 
anticoagulants. 

In a study conducted by the life science company Nucleix and published in the journal Forensic Science 
International, scientists found that an In vitro synthesized sample of DNA matching any desired genetic profile can 
be constructed using standard molecular biology techniques without obtaining any actual tissue from that person. 
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DNA evidence as evidence in criminal trials 
Familial searching 

Familial searching is the use of family members' DNA to identify a closely related suspect in jurisdictions where 
large DNA databases exist, but no exact match has been found. The first successful use of the practice was in a UK 
case where a man was convicted of manslaughter when he threw a brick stained with his own blood into a moving 
car. Police could not get an exact match to the UK's DNA database because the man had no criminal convictions, but 
police implicated him using a close relative's DNAJ 14 ^ 

Surreptitious DNA collecting 

Police forces may collect DNA samples without the suspects' knowledge, and use it as evidence. Legality of this 
mode of proceeding has been questioned in Australia. 

In the United States, it has been accepted, courts often claiming that there was no expectation of privacy, citing 
California v. Greenwood (1985), during which the Supreme Court held that the Fourth Amendment does not prohibit 
the warrantless search and seizure of garbage left for collection outside the curtilage of a home. Critics of this 
practice underline the fact that this analogy ignores that "most people have no idea that they risk surrendering their 
genetic identity to the police by, for instance, failing to destroy a used coffee cup. Moreover, even if they do realize 
it, there is no way to avoid abandoning one's DNA in public." tl5] 

In the UK, the Human Tissue Act of 2004 prohibited private individuals from covertly collecting biological samples 
(hair, fingernails, etc.) for DNA analysis, but excluded medical and criminal investigations from the offenseJ 16 ^ 

England and Wales 

Evidence from an expert who has compared DNA samples must be accompanied by evidence as to the sources of the 

ri7i 

samples and the procedures for obtaining the DNA profiles. The judge must ensure that the jury must understand 
the significance of DNA matches and mismatches in the profiles. The judge must also ensure that the jury does not 
confuse the 'match probability' (the probability that a person that is chosen at random has a matching DNA profile to 
the sample from the scene) with the 'likelihood ratio' (the probability that a person with matching DNA committed 
the crime). In R v. Doheny, EWCA Crim 728 ^ . Phillips LJ gave this example of a summing up, which should be 
carefully tailored to the particular facts in each case: 

Members of the Jury, if you accept the scientific evidence called by the Crown, this indicates that there 
are probably only four or five white males in the United Kingdom from whom that semen stain could 
have come. The Defendant is one of them. If that is the position, the decision you have to reach, on all 
the evidence, is whether you are sure that it was the Defendant who left that stain or whether it is 
possible that it was one of that other small group of men who share the same DNA characteristics. 

Juries should weigh up conflicting and corroborative evidence, using their own common sense and not by using 

noi 

mathematical formulae, such as Bayes' theorem, so as to avoid "confusion, misunderstanding and misjudgment". 

Presentation and evaluation of evidence of partial or incomplete DNA profiles 

R v Bates (2006) EWCA Crim 1395 Moore-Bick LJ said: 

"We can see no reason why partial profile DNA evidence should not be admissible provided that the jury are 
made aware of its inherent limitations and are given a sufficient explanation to enable them to evaluate it. 
There may be cases where the match probability in relation to all the samples tested is so great that the judge 
would consider its probative value to be minimal and decide to exclude the evidence in the exercise of his 
discretion, but this gives rise to no new question of principle and can be left for decision on a case by case 
basis. However, the fact that there exists in the case of all partial profile evidence the possibility that a 
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"missing" allele might exculpate the accused altogether does not provide sufficient grounds for rejecting such 
evidence. In many there is a possibility (at least in theory) that evidence exists which would assist the accused 
and perhaps even exculpate him altogether, but that does not provide grounds for excluding relevant evidence 
that is available and otherwise admissible, though it does make it important to ensure that the jury are given 
sufficient information to enable them to evaluate that evidence properly "J 20 ^ 

DNA testing in the US 

T211 

There are state laws on DNA profiling in all 50 states of the United States. Detailed information on database laws 
in each state can be found at the National Conference of State Legislatures website. 

Development of artificial DNA 

In August 2009, scientists in Israel stunned the forensic sciences and raised serious questions concerning law 
enforcement's use of DNA matching as the ultimate method of identification. In a paper published in the journal 
Forensic Science International: Genetics, the Israeli researchers demonstrated that it is possible to manufacture 
DNA in a laboratory, and thus falsify DNA evidence. The scientists had fabricated saliva and blood samples, which 
originally contained DNA from a person other than the ostensible donor of the blood and saliva. 1 

Additionally, and perhaps more frighteningly, the same researchers showed that, using a DNA database, it is possible 
to take information from a profile and actually manufacture DNA to match it. Worse, this can done without access to 
any actual DNA from the person whose DNA they are duplicating. The synthetic DNA oligos required for the 
procedure are used in probably every molecular laboratory. 1 J 

Dr. Daniel Frumkin, lead author on the paper, was quoted in The New York Times as saying, "You can just engineer a 
crime scene... any biology undergraduate could perform this." 

Dr. Frumkin fortunately has perfected a test that can forensically differentiate real DNA samples from fake ones. His 
test uses epigenetic modifications, in particular, DNA methylation. Seventy percent of the DNA in any human 
genome is methylated, meaning it contains methyl group modifications within a CpG dinucleotide context. 
Methylation at the promoter region is associated with gene silencing. It appears that the synthetic DNA lacks this 
epigenetic modification, which allows the test to be used to distinguish manufactured DNA from original, genuine, 
DNA. [23] 

The idea that DNA can be fabricated, and then planted at a crime scene, is now reality; fortunately Dr. Frumkin has 
developed the test to differentiate real from fake DNA. But it is unknown how many, if any, police departments 
currently use the test, which is distressing considering Frumkin' s claim that the DNA manufacturing procedure is 
within the grasp of any undergraduate biology student. No police lab has publicly announced that it is using the new 
test to verify DNA results, while FSI Genetics says that any forensic laboratory doing DNA identification should 
adopt this test to authenticate its results as "real" DNA. 174 - 1 



Cases 

• In the 1950s, Anna Anderson claimed that she was Grand Duchess Anastasia Nikolaevna of Russia; in the 1980s 
after her death, samples of her tissue that had been stored at a Charlottesville, Virginia hospital following a 
medical procedure were tested using DNA fingerprinting and showed that she bore no relation to the 
Romanovs P^ 

• In 1986, Richard Buckland was exonerated despite having admitted to the rape and murder of a teenager near 
Leicester, the city where DNA profiling was first discovered. This was the first use of DNA finger printing in a 
criminal investigation P^ 

• In 1987 it was the first time genetic fingerprinting was used in criminal court where a man accused of unlawful 

i"27] 

intercourse with a mentally handicapped 14-year-old female who gave birth to his baby. 
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• In 1987, in the same case as Buckland, British baker Colin Pitchfork was the first criminal caught and convicted 
using DNA fingerprinting/ 28 ^ 

• In 1987, Florida rapist Tommy Lee Andrews was the first person in the United States to be convicted as a result 

of DNA evidence, for raping a woman during a burglary; he was convicted on 6 November 1987 and sentenced to 

oo • [29] [30] 

22 years m prison. 

• In 1988, Timothy Spencer was the first man in Virginia to be sentenced to death through DNA testing, for several 
rape and murder charges; he was dubbed "The South Side Strangler" because he killed victims on the southside of 
Richmond, Virginia. He was later charged with rape and 1st degree murder and was sentenced to death. He was 
executed on April 27, 1994. David Vasquez, initially convicted of one of Spencer's crimes, became the first man 
in America exonerated based on DNA evidence. 

• In 1989, Chicago man Gary Dotson was the first person whose conviction was overturned using DNA evidence. 

• In 1991, Allan Legere was the first Canadian to be convicted as a result of DNA evidence, for four murders he 
had committed while an escaped prisoner in 1989. During his trial, his defense argued that the relatively shallow 
gene pool of the region could lead to false positives. 

• In 1992, DNA evidence was used to prove that Nazi doctor Josef Mengele was buried in Brazil under the name 
Wolfgang Gerhard. 

• In 1993, Kirk Bloodsworth was the first person to have been convicted of murder and sentenced to death, whose 
conviction was overturned using DNA evidence. 

• The 1993 rape and murder of Mia Zapata, lead singer for the Seattle punk band The Gits was unsolved 9 years 
after the murder. A database search in 2001 failed, but the killer's DNA was collected when he was arrested in 
Florida for burglary and domestic abuse in 2002. 

• The science was made famous in the United States in 1994 when prosecutors heavily relied on — and through 
expert witnesses exhaustively presented and explained — DNA evidence allegedly linking O.J. Simpson to a 
double murder. The case also brought to light the laboratory difficulties and handling procedure mishaps which 
can cause such evidence to be significantly doubted. 

• In 1994, Royal Canadian Mounted Police (RCMP) detectives successfully tested hairs from a cat known as 
Snowball, and used the test to link a man to the murder of his wife, thus marking for the first time in forensic 
history the use of non-human DNA to identify a criminal. 

• In 1998, Dr. Richard J. Schmidt was convicted of attempted second-degree murder when it was shown that there 
was a link between the viral DNA of the human immunodeficiency virus (HIV) he had been accused of injecting 
in his girlfriend and viral DNA from one of his patients with full-blown AIDS. This was the first time viral DNA 
fingerprinting had been used as evidence in a criminal trial. 

• In 1999, Raymond Easton a disabled man from Swindon, England was arrested and detained for 7 hours in 
connection with a burglary due to an inaccurrate DNA match. His DNA had been retained on file after an 
unrelated domestic incident some time previously. 

• In May 2000 Gordon Graham murdered Paul Gault at his home in Lisburn, Northern Ireland. Graham was 
convicted of the murder when his DNA was found on a sports bag left in the house as part of an elaborate ploy to 
suggest the murder occurred after a burglary had gone wrong. Graham was having an affair with the victims wife 
at the time of the murder. It was the first time Low Copy Number DNA was used in Northern 
Ireland.Media:http://www.belfasttelegraph^ 

In 2001, Wayne Butler was convicted for the murder of Celia Douty. It was the first murder in Australia to be solved 
T321 1-33] 

using DNA profiling. 

• In 2002, DNA testing was used to exonerate Douglas Echols, a man who was wrongfully convicted in a 1986 rape 
case. Echols was the 114th person to be exonerated through post-conviction DNA testing. 

• In August 2002, Annalisa Vincenzi was shot dead in Tuscany. Some time later, Bartender Peter Hamkin, 23, was 
arrested in Merseyside in March 2003 on an extradition warrant heard at Bow Street Magistrates' Court in London 
to establish whether he should be taken to Italy to face a murder charge. DNA "proved" he shot her, but he was 
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cleared on other evidence. 

• In 2003, Welshman Jeffrey Gafoor was convicted of the 1988 murder of Lynette White, when crime scene 
evidence collected 12 years earlier was re-examined using STR techniques, resulting in a match with his 
nephew. This may be the first known example of the DNA of an innocent yet related individual being used to 
identify the actual criminal, via "familial searching". 

• In June 2003, because of new DNA evidence, Dennis Halstead, John Kogut and John Restivo won a re-trial on 
their murder conviction. The three men had already served eighteen years of their thirty-plus-year sentences. 

• The trial of Robert Pickton is notable in that DNA evidence is being used primarily to identify the victims, and in 
many cases to prove their existence. 

• In March 2003, Josiah Sutton was released from prison after serving four years of a twelve-year sentence for a 
sexual assault charge. Questionable DNA samples taken from Sutton were retested in the wake of the Houston 
Police Department's crime lab scandal of mishandling DNA evidence. 

• In 2004, DNA testing shed new light into the mysterious 1912 disappearance of Bobby Dunbar, a four-year-old 
boy who vanished during a fishing trip. He was allegedly found alive eight months later in the custody of William 
Cantwell Walters, but another woman claimed that the boy was her son, Bruce Anderson, whom she had entrusted 
in Walters' custody. The courts disbelieved her claim and convicted Walters for the kidnapping. The boy was 
raised and known as Bobby Dunbar throughout the rest of his life. However, DNA tests on Dunbar's son and 
nephew revealed the two were not related, thus establishing that the boy found in 1912 was not Bobby Dunbar, 
whose real fate remains unknown P 6 ^ 

• In 2005, Gary Leiterman was convicted of the 1969 murder of Jane Mixer, a law student at the University of 
Michigan, after DNA found on Mixer's pantyhose was matched to Leiterman. DNA in a drop of blood on Mixer's 
hand was matched to John Ruelas, who was only four years old in 1969 and was never successfully connected to 
the case in any other way. Leiterman's defense unsuccessfully argued that the unexplained match of the blood spot 
to Ruelas pointed to cross-contamination and raised doubts about the reliability of the lab's identification of 
Leiterman [37][38][39] 

• In December 2005, Evan Simmons was proven innocent of a 1981 attack on an Atlanta woman after serving 
twenty-four years in prison. Mr Clark is the 164th person in the United States and the fifth in Georgia to be freed 
using post-conviction DNA testing. 

• In March 2009, Sean Hodgson who spent 27 years in jail, convicted of killing Teresa De Simone, 22, in her car in 
Southampton 30 years ago was quashed by senior judges. Tests prove DNA from the scene was not his. British 
police have now reopened the case. 



See also 

• capillary electrophoresis (CE) 

• Forensic identification 

• Full genome sequencing 

• Gene mapping 

• genealogical DNA test 

• Harvey v. Horan 

• Identification (biology) 

• National DNA database 

• Parental testing 

• Phantom of Heilbronn 

• Project Innocence 

• restriction fragment length polymorphism (RFLP) 

• ribotyping 

• short tandem repeat (STR) 
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• State of Louisiana v. Frisard 



External links 

[40] Eureka moment that led to the discovery of DNA fingerprinting 
EHSTRAFD.org - Earth Human STR Allele Frequencies Database [41] 
Why A DNA Test should include the Mothers DNA [42] 
How to Choose a Qualified DNA Laboratory ^ 

HaploGroups.com ^ 44 ^ The most comprehensive resource for DNA testing to help you find your genetic origin or 
Haplogroup 

How to make a DNA (genetic) Fingerprint and applications of DNA Fingerprinting ^ 45 ^ 
Create a DNA Fingerprint ^ 

Fingerprinting.com ^ DNA Fingerprinting Identification and Methods 
Forensic genetics and ethical, legal and social implications beyond the clinic ^ 

In silico simulation of Molecular Biology Techniques ^ - A place to learn typing techniques by simulating them 
The History of DNA Testing ^ Interactive presentation uncovering the historical origins of DNA testing and 
highlighting famous paternity testing cases, (requires Adobe Flash) 
[51] National DNA-Databases in the EU (Nathan Van Camp & Kris Dierickx) 
NPIA NDNAD Website [52] 

T531 

The Evaluation of Forensic DNA Evidence 
Key Dates in the History of DNA Profiling [54] 
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DNA polymerase 



A DNA polymerase is an enzyme that catalyzes the polymerization of 
deoxyribonucleotides into a DNA strand. DNA polymerases are 
best-known for their role in DNA replication, in which the polymerase 
"reads" an intact DNA strand as a template and uses it to synthesize the 
new strand. This process copies a piece of DNA. The 
newly-polymerized molecule is complementary to the template strand 
and identical to the template's original partner strand. DNA 
polymerases use a magnesium ion for catalytic activity. 




3D structure of the DNA-binding helix-turn-helix 
motifs in human DNA polymerase beta 



Function 



DNA polymerase can add free nucleotides to only the 3' end of the 
newly-forming strand. This results in elongation of the new strand in a 
5 '-3' direction. No known DNA polymerase is able to begin a new 
chain {de novo). DNA polymerase can add a nucleotide onto only a 
preexisting 3'-OH group, and, therefore, needs a primer at which it can 
add the first nucleotide. Primers consist of RNA and DNA bases with 
the first two bases always being RNA, and are synthesized by another 
enzyme called primase. An enzyme known as a helicase is required to 
unwind DNA from a double- strand structure to a single- strand 
structure to facilitate replication of each strand consistent with the 
semiconservative model of DNA replication. 

Error correction is a property of some, but not all, DNA polymerases. 
This process corrects mistakes in newly- synthesized DNA. When an 
incorrect base pair is recognized, DNA polymerase reverses its 
direction by one base pair of DNA. The 3 '-5' exonuclease activity of 
the enzyme allows the incorrect base pair to be excised (this activity is 
known as proofreading). Following base excision, the polymerase can 
re-insert the correct base and replication can continue. 

Various DNA polymerases are extensively used in molecular biology 
experiments. 



DNA polymerase 



Variation across species 




DNA polymerase with proofreading ability 



DNA polymerases have highly-conserved structure, which means that 

their overall catalytic subunits vary, on a whole, very little from species to species. Conserved structures usually 
indicate important, irreplicable functions of the cell, the maintenance of which provides evolutionary advantages. 
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Some viruses also encode special DNA polymerases, such as Hepatitis B virus DNA polymerase. These may 
selectively replicate viral DNA through a variety of mechanisms. Retroviruses encode an unusual DNA polymerase 
called reverse transcriptase, which is an RNA-dependent DNA polymerase (RdDp). It polymerizes DNA from a 
template of RNA. 

DNA polymerase families 

Based on sequence homology, DNA polymerases can be further subdivided into seven different families: A, B, C, D, 
X, Y, and RT. 

Family A 

Polymerases contain both replicative and repair polymerases. Replicative members from this family include the 
extensively-studied T7 DNA polymerase, as well as the eukaryotic mitochondrial DNA Polymerase y. Among the 
repair polymerases are Escherichia coli DNA pol I, Thermus aquaticus pol I, and Bacillus stearothermophilus pol I. 
These repair polymerases are involved in excision repair and processing of Okazaki fragments generated during 
lagging strand synthesis. 

Family B 

Polymerases mostly contain replicative polymerases and include the major eukaryotic DNA polymerases a, 6, 8, (see 
Greek letters) and also DNA polymerase Family B also includes DNA polymerases encoded by some bacteria and 
bacteriophages, of which the best-characterized are from T4, Phi29, and RB69 bacteriophages. These enzymes are 
involved in both leading and lagging strand synthesis during replication. A hallmark of the B family of polymerases 
is their highly faithful DNA synthesis during replication. While many have an intrinsic 3'-5' proofreading 
exonuclease activity, eukaryotic DNA polymerases a and £ are two examples of B family polymerases lacking this 
proofreading activity. 

Family C 

Polymerases are the primary bacterial chromosomal replicative enzymes. DNA Polymerase III alpha subunit from E. 
coli is the catalytic subunit ^ and possesses no known nuclease activity. A separate subunit, the epsilon subunit, 
possesses the 3'-5' exonuclease activity used for editing during chromosomal replication. Recent research has 
classified Family C polymerases as a subcategory of Family X. 

Family D 

Polymerases are still not very well characterized. All known examples are found in the Euryarchaeota subdomain of 
Archaea and are thought to be replicative polymerases. 

Family X 

Contains the well-known eukaryotic polymerase pol (3, as well as other eukaryotic polymerases such as pol a, pol X, 
pol \x, and terminal deoxynucleotidyl transferase (TdT). Pol (3 is required for short-patch base excision repair, a DNA 
repair pathway that is essential for repairing abasic sites. Pol X and Pol \x are involved in non-homologous 
end-joining, a mechanism for rejoining DNA double-strand breaks. TdT is expressed only in lymphoid tissue, and 
adds "n nucleotides" to double-strand breaks formed during V(D)J recombination to promote immunological 
diversity. The yeast Saccharomyces cerevisiae has only one Pol X polymerase, Pol4, which is involved in 
non-homologous end-joining. 
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Family Y 

Polymerases differ from others in having a low fidelity on undamaged templates and in their ability to replicate 
through damaged DNA. Members of this family are hence called translesion synthesis (TLS) polymerases. 
Depending on the lesion, TLS polymerases can bypass the damage in an error-free or error-prone fashion, the latter 
resulting in elevated mutagenesis. Xeroderma pigmentosum variant (XPV) patients for instance have mutations in 
the gene encoding Pol r| (eta), which is error-free for UV-lesions. In XPV patients, alternative error-prone 
polymerases, e.g., Polt, (zeta) (polymerase E, is a B Family polymerase a complex of the catalytic subunit REV3L 
with Rev7, which associates with Revl ), are thought to be involved in mistakes that result in the cancer 
predisposition of these patients. Other members in humans are Pol i (iota), Pol k (kappa), and Revl (terminal 
deoxycytidyl transferase). In E. coli, two TLS polymerases, Pol IV (DINB) and PolV (UmuD' 2 C), are known. 

Family RT 

The reverse transcriptase family contains examples from both retroviruses and eukaryotic polymerases. The 
eukaryotic polymerases are usually restricted to telomerases. These polymerases use an RNA template to synthesize 
the DNA strand. 

Prokaryotic DNA polymerases 

Bacteria have 5 known DNA polymerases: 

• Pol I: implicated in DNA repair; has 5'->3' (Polymerase) activity and both 3'->5' exonuclease (Proofreading) and 
5'->3' exonuclease activity (RNA Primer removal). 

• Pol II: involved in reparation of damaged DNA; has 3'->5' exonuclease activity. 

• Pol III: the main polymerase in bacteria (elongates in DNA replication); has 3'->5' exonuclease proofreading 
ability. 

• Pol IV: a Y-family DNA polymerase. 

• Pol V: a Y-family DNA polymerase; participates in bypassing DNA damage. 

Eukaryotic DNA polymerases 

Eukaryotes have at least 15 DNA Polymerases: 

• Pol a (synonyms are RNA primase, DNA polymerase): forms a complex with a small catalytic (PriS) and a large 
noncatalytic (PriL) subunit 1 ^ , with the Pri subunits acting as a primase (synthesizing an RNA primer), and then 
with DNA Pol a elongating that primer with DNA nucleotides. After around 20 nucleotides^ elongation is taken 
over by Pol 8 (on the leading strand) and 6 (on the lagging strand). 

• Pol p: Implicated in repairing DNA, in base excision repair and gap-filling synthesis. 

• Pol y: Replicates and repairs mitochondrial DNA and has proofreading 3'->5' exonuclease activity. 

• Pol 6: Highly processive and has proofreading 3'->5' exonuclease activity. Thought to be the main polymerase 
involved in lagging strand synthesis, though there is still debate about its role 1 ^ . 

• Pol e: Also highly processive and has proofreading 3'->5' exonuclease activity. Highly related to pol 6, and 

1-7-1 

thought to be the main polymerase involved in leading strand synthesis , though there is again still debate about 
its role^ . 

• f], 1, k, and Revl are Y-family DNA polymerases and Pol £ is a B -family DNA polymerase. These polymerases 

rsi 

are involved in the bypass of DNA damage. 

• There are also other eukaryotic polymerases known, which are not as well characterized: 0, X, (p, a, and p. 

None of the eukaryotic polymerases can remove primers (5'->3' exonuclease activity); that function is carried out by 
other enzymes. Only the polymerases that deal with the elongation (y, 6 and e) have proofreading ability (3'->5' 
exonuclease). 
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See also 

• Polymerase chain reaction 

• RNA polymerase 



External links 

• Burgers P, Koonin E, Bruford E et al. (2001). "Eukaryotic DNA polymerases: proposal for a revised 
nomenclature". J. Biol Chem. 276 (47): 43487-90. doi:10.1074/jbc.R100056200. PMID 11579108. 

• PDB Molecule of the Month pdb 3_1 [9] 

• Unusual repair mechanism in DNA polymerase lambda ^ l0 \ Ohio State University, July 25, 2006. 

• MeSH DNA+polymerases 

• EC 2.7.7.7 [12] 

ri3i 

• A great animation of DNA Polymerase from WEHI 
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DNA Topoisomerase 



Topoisomerases (type I: EC 5.99.1.2 [2] , type II: EC 5.99.1.3 [3] ) are 
enzymes that unwind and wind DNA, in order for DNA to control the 
synthesis of proteins, and to facilitate DNA replication. The enzyme is 
necessary due to inherent problems caused by the the DNA's double 
helix. The structure of DNA is a double- stranded helix, where the four 
bases, adenosine, thymidine, guanidine, and cytosine are paired and 
stored in the center of this helix. While this structure provides a stable 
means of storing the genetic code, Watson and Crick noted that the two 
strands of DNA are intertwined and this would require the two strands 
to be untwisted in order to access the information stored. However they 
also foresaw that there would be some mechanism to overcome this 
problem. 

In order to help overcome these problems caused by the double helix, 
topoisomerases bind to either single- stranded or double stranded DNA 
and cut the phosphate backbone of the DNA. This intermediate break 
allows the DNA to be untangled or unwound, and at the end of these 
processes, the DNA is reconnected again. Since the overall chemical 
composition and connectivity of the DNA does not change, the tangled 
and untangled DNAs are chemical isomers, differing only in their 
global topology, thus their name. Topoisomerases are isomerase 




Topoisomerase I solves the problem caused by 
tension generated by winding/unwinding of 
DNA. It wraps around DNA and makes a cut 
permitting the helix to spin. Once DNA is 
relaxed, topoisomerase reconnects broken strands 
(PDB la36 



enzymes that act on the topology of DNA. 



[4] 



Discovery 

The need for this enzyme was recognized long before it was discovered. When the double-helical nature of DNA 
was determined by Watson and Crick, the authors noted that there must be some mechanism that would resolve the 
tangles that arise from this structural feature. The enzyme, originally termed gyrase, was first discovered by Harvard 
Professor James C. WangJ 5 ^ 

Function 

The double-helical configuration that DNA strands naturally reside in makes them difficult to separate, and yet they 
must be separated by helicase proteins if other enzymes are to transcribe the sequences that encode proteins, or if 
chromosomes are to be replicated. In so-called circular DNA, in which double helical DNA is bent around and 
joined in a circle, the two strands are topologically linked, or knotted. Otherwise identical loops of DNA having 
different numbers of twists are topoisomers, and cannot be interconverted by any process that does not involve the 
breaking of DNA strands. Topoisomerases catalyze and guide the unknotting of DNA by creating transient breaks in 
the DNA using a conserved Tyrosine as the catalytic residue.^ 

The insertion of viral DNA into chromosomes and other forms of recombination can also require the action of 
topoisomerases. 
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Clinical significance 

See also topoisomerase inhibitor 

Many drugs operate through interference with the topoisomerases. The broad- spectrum fluoroquinolone antibiotics 
act by disrupting the function of bacterial type II topoisomerases. 

Some chemotherapy drugs work by interfering with topoisomerases in cancer cells: 

• type 1 is inhibited by irinotecan and topotecan. 

• type 2 is inhibited by etoposide(VP-16), teniposide and HU-331, a quinolone synthesized from cannabidiol. 
Topoisomerase I is the antigen recognized by Anti Scl-70 antibodies in scleroderma. 

These small molecule inhibitors act as efficient anti-bacterial and anti-cancer agents by hijacking the natural ability 
of topoisomerase to create breaks in chromosomal DNA. These breaks in DNA accumulate, ultimately leading to 
programmed cell death, or apoptosis. 

Topological problems 

There are three main types of topology: supercoiling, knotting and catenation. Outside of the essential processes of 
replication or transcription, DNA needs to be kept as compact as possible and these three states help this cause. 
However when transcription or replication occur, DNA needs to be free and these states seriously hinder the 
processes. In addition, during replication, the newly replicated duplex of DNA and the original duplex of DNA 
become intertwined and need to be completely separated in order to ensure genomic integrity as a cell divides. As a 
transcription bubble proceeds, DNA ahead of the transcription fork becomes overwound, or positively supercoiled, 
while DNA behind the transcription bubble becomes underwound, or negatively supercoiled. As replication occurs, 
DNA ahead of the replication bubble becomes positively supercoiled, while DNA behind the replication fork 
becomes entangled forming precatenanes. One of the most essential topological problem occurs at the very end of 
replication, when daughter chromosomes must be fully disentangled before mitosis occurs. Topoisomerase II A plays 
an essential role in resolving these topological problems. 

Classes 

Topoisomerases can fix these topological problems and are separated into two types separated by the number of 
strands cut in one round of action: ^ Both these classes of enzyme utilize a conserved tyrosine. However these 
enzymes are structurally and mechanistically different. 

• Type I topoisomerase cuts one strand of a DNA double helix, relaxation occurs, and then the cut strand is 
reannealed. Type I topoisomerases are subdivided into two subclasses: type IA topoisomerases which share many 
structural and mechanistic features with the type II topoisomerases, and type IB topoisomerases, which utilize a 
controlled rotary mechanism. Examples of type IA topoisomerases include topo I and topo III. Historically, type 
IB topoisomerases were referred to as eukaryotic topo I, but IB topoisomerases are present in all three domains of 
life. Interestingly, type IA topoisomerases form a covalent intermediate with the 5' end of DNA, while the IB 
topoisomerases form a covalent intermediate with the 3' end of DNA. Recently, a type IC topoisomerase has been 
identified, called topo V. While it is structurally unique from type IA and IB topoisomerases, it shares a similar 
mechanism with type IB topoisomerase. 

• Type II topoisomerase cuts both strands of one DNA double helix, passes another unbroken DNA helix through 
it, and then reanneals the cut strand. It is also split into two subclasses: type IIA and type IIB topoisomerases, 
which share similar structure and mechanisms. Examples of type IIA topoisomerases include eukaryotic topo II, 
E. coli gyrase, and E. coli topo IV. Examples of type IIB topoisomerase include topo VI. 

Both type I and type II topoisomerases change the linking number of DNA. Type I A topoisomerases change the 
linking number by one, type IB and type IC topoisomerases change the linking number by any integer, while type 
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II A and type IIB topoisomerases change the linking number by two. 

Further reading 

• James C. Wang (2009) Untangling the Double Helix. DNA Entanglement and the Action of the DNA 
Topoisomerases, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, 2009. 245 pp. ISBN 
9780879698799 

See also 

• DNA topology 

• Supercoil 

• TOPI 

• Type II topoisomerase 

External links 

• MeSH DNA+Topoisomerases 
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List of nucleic acid simulation software 




it 



This is a list of computer programs that are used for nucleic acids simulations. 
Min - Optimization, MD - Molecular dynamics, MC - Monte Carlo, 

Crt - Cartesian coordinates. Int - Internal coordinates Exp - Explicit water. Imp - Implicit water. 
Lig - Ligands interactions. HA - Hardware accelerated. 



Name 


View 
3D 


Model 
Build 


Min 


MD 


MC 


Crt 


Int 


Exp 


Imp 


Lig 


HA 


Comments 


License 


Homepage 


Abalone 


+ 


+ 


+ 


+ 




+ 




+ 


+ 


+ 




DNA, proteins, 
ligands 


Free 


Agile Molecule ^ 


AMBER [2] 




+ 


+ 


+ 




+ 




+ 


+ 


+ 




AMBER Force Field 


Commercial 


ambermd.org ^ 


CHARMM 




+ 


+ 


+ 


+ 


+ 




+ 


+ 


+ 




CHARMM Force 
Field 


Commercial 


charmm.org ^ 


ICM [5] 


+ 


+ 


+ 




+ 




+ 




+ 






Global optimization 


Commercial 


Molsoft [6] 


rvi 

JUMNA L J 




+ 


+ 








+ 




+ 








Commercial 




MDynaMix 
[8] 


+ 


+ 




+ 




+ 




+ 




+ 




Common MD 


GPL 


Stockholm 

[9] 

University 


MOE 


+ 


+ 


+ 


+ 




+ 




+ 




+ 




Molecular Operating 
Environment 


Commercial 


Chemical Computing 
Group [10] 


NAB [11] 




+ 




















Nucleic Acid Builder 


GPL 


New Jersey 

• [12] 
University 


NAMD 


+ 




+ 


+ 




+ 




+ 




+ 


+ 


NAnoscale Molecular 
Dynamics 


Free 


University of Illinois 
[13] 



See also 

• Molecular Modelling 

• Molecular modelling on GPU 

• Molecular graphics 

• Molecular mechanics 

• Molecular dynamics 

• Molecular design software 

• Molecule editor 

• Quantum chemistry computer programs 

• List of molecular graphics systems 

• List of protein structure prediction software 

• List of RNA structure prediction software 

• List of software for molecular mechanics modeling 

• List of software for nanostructures modeling 
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• Force field 

• Force field implementation 
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DNA Structure and Functions 



DNA structure 



Nucleic acid structure refers to the structure of nucleic acids such as DNA and RNA It is ofter divided into four 
different levels: 

• Primary structure — the raw sequence of nucleobases of each of the component DNA strands; 

• Secondary structure — the set of interactions between bases, i.e., which parts of which strands are bound to each 
other; 

• Tertiary structure — the locations of the atoms in three-dimensional space, taking into consideration geometrical 
and steric constraints; and 

• Quaternary structure — the higher-level organization of DNA in chromatin, or to the interactions between separate 
RNA units in the ribosome or spliceosome. 

See also 

• Nucleic acid structure determination (experimental) 

• Nucleic acid structure prediction (computational) 

• Nucleic acid design 

• DNA nanotechnology 

• Nucleic acid thermodynamics 

• DNA supercoil 
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Molecular Structure of Nucleic Acids: A 
Structure for Deoxyribose Nucleic Acid 



First Publication 

Discovery of the DNA Double Helix 




An early sketch of the DNA double helix. 



William Astbury 



Francis Crick 



Jerry Donohue 



Phoebus Levene 



Erwin Schrodinger 



James Watson 



Oswald Avery 



Erwin Chargaff 



Rosalind Franklin 



Linus Pauling 



Alec Stokes 



Maurice Wilkins 



The Molecular structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid was an article published by 
James D. Watson and Francis Crick in the scientific journal Nature in its 171st volume on page 737-738 (dated April 
25, 1953). [1] It was the first publication which described the discovery of the double helix structure of DNA. This 
discovery had a major impact on genetics in particular and biology in general. 

This article is often termed a "pearl" of science because it is brief and contains the answer to a fundamental mystery 
about living organisms. This mystery was the question of how it was possible that genetic instructions were held 
inside organisms and how they were passed from generation to generation. The article presents a simple and elegant 
solution, which surprised many biologists at the time who felt that DNA transmission was going to be more difficult 
to detail and understand. 
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The nature of the discovery 



Professor Rubin's 1953 article contains the answer to a fundamental 
mystery about living organisms. The nature of their discovery was 
distinctive and in some ways unsurprising. What is hidden in the 
technical jargon of the title is that Watson and Crick's discovery of the 
chemical structure of DNA finally revealed how genetic instructions 
are stored inside organisms and passed from generation to generation. 

Origins of molecular biology 

The application of physics and chemistry to biological problems led to 
the development of molecular biology. Not all biology that concerns 
molecules falls into the category that is labelled "molecular biology". 
Molecular biology is particularly concerned with the flow and 
consequences of biological information at the level of genes and 
proteins. The discovery of the DNA double helix made clear that genes 
are functionally defined parts of DNA molecules and that there must be 
a way for cells to make use of their DNA genes in order to make 
proteins. 



Adenine 



Thymine 



Guanine 



Cytokine 




Figure 2. Diagramatic representation of the key 
structural features of the DNA double helix. This 
figure does not depict B-DNA. 



Linus Pauling was a chemist who was very influential in developing an 
understanding of the structure of biological molecules. In 1951, Pauling published the structure of the alpha helix, a 
fundamentally important structural component of proteins. Early in 1953 Pauling published an incorrect triple helix 
model of DNA. Both Crick, and particularly Watson, felt that they were racing against Pauling to discover the 
structure of DNA. 

Max Delbruck was a physicist who recognized some of the biological implications of quantum physics. Delbruck's 
thinking about the physical basis of life stimulated Erwin Schrodinger to write the highly influential book, What Is 
Life? Schrodinger's book was an important influence on Francis Crick, James D. Watson, and Maurice Wilkins who 
won a Nobel prize in Medicine for the discovery of the DNA double helix. Delbruck's efforts to promote the "Phage 
Group" (exploring genetics by way of the viruses that infect bacteria) was important in the early development of 
molecular biology in general and the development of Watson's scientific interests in particular. 
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DNA structure and function 

It is not always the case that the structure of a molecule is easy to relate to its function. What makes the structure of 
DNA so obviously related to its function was described modestly at the end of the article: "It has not escaped our 
notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the 
genetic material". 

The "specific pairing" is a key feature of the Watson and Crick model 
of DNA, the pairing of nucleotide subunits. In DNA, the amount of 
guanine is equal to cytosine and the amount of adenine is equal to 
thymine. The A:T and C:G pairs are structurally similar. In particular, 
the length of each base pair is the same and they fit equally between 
the two phosphate backbones (Figure 2). The base pairs are held 
together by hydrogen bonds, a type of chemical attraction that is easy 
to break and easy to reform. After realizing the structural similarity of 
the A:T and C:G pairs, Watson and Crick soon produced their double 
helix model of DNA with the hydrogen bonds at the core of the helix 
providing a way to unzip the two complementary strands for easy 
replication: the last key requirement for a likely model of the genetic 
molecule. 

Indeed, the base-pairing did suggest a way to copy a DNA molecule. 
Just pull apart the two phosphate backbones, each with its hydrogen 
bonded A, T, G, and C components. Each strand could then be used as 
a template for assembly of a new base-pair complementary strand. 

Future considerations 

When Watson and Crick produced their double helix model of DNA, it 
was known that most of the specialized features of the many different 
life forms on Earth are made possible by proteins. Structurally, 
proteins are long chains of amino acid subunits. In some way, the 
genetic molecule, DNA, had to contain instructions for how to make 
the thousands of proteins found in cells. From the DNA double helix model, it was clear that there must be some 
correspondence between the linear sequences of nucleotides in DNA molecules to the linear sequences of amino 
acids in proteins. The details of how sequences of DNA instruct cells to make specific proteins was worked out by 
molecular biologists during the period from 1953 to 1965. Francis Crick played an integral role in both the theory 
and analysis of the experiments that led to an improved understanding of the genetic code . 

Consequences 

Other advances in molecular biology stemming from the discovery of the DNA double helix eventually led to ways 
to sequence genes. James Watson played an important role in getting government funding for the Human Genome 
Project. The ability to sequence and manipulate DNA is now central to the biotechnology industry and modern 
medicine. The austere beauty of the structure and the practical implications of the DNA double helix combined to 
make Molecular structure of Nucleic Acids; A Structure for Deoxyribose Nucleic Acid one of the most prominent 
biology articles of the twentieth century. 




Figure 3. DNA replication. The two base-pair 
complementary chains of the DNA molecule 
allow for replication of the genetic instructions. 
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Controversy 

Watson and Crick based their molecular model of the DNA double helix on data that had been collected by 
researchers in several other laboratories. Watson and Crick were the first to put together all of the scattered 
fragments of information that were required to produce a successful molecular model of DNA. 

Much of the data that were used by Crick and Watson came from unpublished work by Maurice Wilkins, Rosalind 
Franklin, A.R. Stokes, and H.R. Wilson at King's College London in the University of London. Key data from 
Wilkins, Stokes, and Wilson, and, separately, by Franklin and Gosling, were published in two separate additional 
articles in the same issue of Nature with the article by Watson and Crick. ^ ^ The article by Watson and Crick did 
acknowledge that they had been "stimulated" by experimental results from the King's College researchers, and a 
similar acknowledgment was published by M. H. F. Wilkins, A.R. Stokes, and H. R. Wilson in the following 
three-page article. 

In 1968, Watson published a highly-controversial ^ ^ autobiographical account of the discovery of the 
double-helical, molecular structure of DNA called The Double Helix, and which was not accepted — at least 
publicly — either by Francis Crick or by M.H.F. Wilkins. Furthermore, Erwin Chargaff also printed a rather 
"unsympathetic review" of James D. Watson's booklet in the March 29, 1968 issue of Science. In his 
v autobiographical' booklet, Watson stated among other things that he and Crick had access to some of Franklin's data 
from a source that she was not aware of, and also that he had seen — without her permission — the B-DNA X-ray 
diffraction pattern obtained by Franklin and Gosling in May 1952 at King's in London. In particular, in late 1952, Dr. 
Franklin had submitted a progress report to the Medical Research Council, which was reviewed by Dr. Max Perutz, 
then at The Cavendish Laboratory of the University of Cambridge, UK. Watson and Crick also worked in the 
MRC- supported Cavendish Laboratory in Cambridge whereas Drs. Wilkins and Franklin were in the MRC supported 
laboratory at King's in London. Such MRC reports were not usually widely circulated, but Crick read a copy of Dr. 
Franklin's research summary in early 1953^ ^ . 

Max Perutz's justification for passing this information to both Crick and Watson was that the report contained 

information which Watson has previously heard in November 1951 when Dr. Franklin talked about her unpublished 

results with Raymond Gosling during a meeting arranged by Dr. M.H.F. Wilkins at King's College, following a 

request from Crick and Watson; this justification does not hold however for Crick who was not present at this 

November 1951 meeting, but who also was given access by Max Perutz to Franklin's MRC report data which 

prompted Crick and Watson to seek permission from Sir Lawrence Bragg—who was at the time the head of the 

Cavendish Laboratory in Cambridge — to publish in Nature their double-helix molecular model of DNA based on 

Franklin's and also Wilkins's data. Moreover, in November 1951 Watson had acquired — by his own 

admission — little training in X-ray crystallography, and therefore had not fully understood (again, according to his 

own admission, in "The Double Helix") what Dr. Franklin was saying about the structural symmetry of the DNA 

molecule. Crick, however, knowing the Fourier transforms of Bessel functions that represent the X-ray diffraction 

patterns of helical structures of atoms, correctly interpreted further one of Dr.Franklin's experimental findings as 

indicating that DNA was most likely to be a double helix with the two polynucleotide chains running in opposite 

directions. Crick was thus in a unique position to make this interpretation because he had previously worked on the 

X-ray diffraction data for other large molecules that had similar, helical symmetry to that of DNA. Dr. Franklin, on 

the other hand, rejected at first the molecular model building approach proposed by Crick and Watson because their 

first DNA model presented by Watson to her and Dr. M.H.F. Wilkins in 1952 in London had an obviously incorrect 

structure with hydrated charged groups on the inside of the model, rather than on the outside, as explicitly admitted 

by James D. Watson in his "Double Helix" booklet J 10 ^ It is therefore questioned whether Crick's colleague, Dr.Max 

Perutz, acted unethically 11 1] by allowing Crick access to Dr. Franklin's MRC report about the crystallographic unit of 

the B-DNA and A-DNA structures. Dr. Perutz claimed, however, that he felt he had not because this report was not 

confidential, and had been designed as part of an effort to promote contact between different MRC research 
[12] 

groups. 
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See also 

• DNA 

• X-ray scattering 

• Paracrystal model and theory 

• Crystallography 

• Nucleic acid modeling 
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• Watson, James D. (1980). The Double Helix: A Personal Account of the Discovery of the Structure of DNA. 
Atheneum. ISBN 0-689-70602-2. (first published in 1968) 

• Wilkins, Maurice (2003). The Third Man of the Double Helix: The Autobiography of Maurice Wilkins. 
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External links 

ri3i 

• Annotated copy of the article from San Francisco's Exploratorium 

• Access Excellence Classic Collection article on DNA structure ^ 14 l 

• Linus Pauling and the Race for DNA: A Documentary History ^ 

Online versions of the article 

• Online version (Original text) at nature.com ^ 

ri7i risi 

• National Library of Medicine's PDF copy in the Francis Crick Documents Collection 

• Commemorative HTML version [19] Am J Psychiatry 160:623-624, April 2003. 
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Molecular models of DNA 



Molecular models of DNA structures are representations of the molecular geometry and 
topology of Deoxyribonucleic acid (DNA) molecules using one of several means, with the 
aim of simplifying and presenting the essential, physical and chemical, properties of DNA 
molecular structures either in vivo or in vitro. These representations include closely packed 
spheres (CPK models) made of plastic, metal wires for 'skeletal models', graphic 
computations and animations by computers, artistic rendering. Computer molecular models 
also allow animations and molecular dynamics simulations that are very important for 
understanding how DNA functions in vivo. 

The more advanced, computer-based molecular models of DNA involve molecular dynamics 
simulations as well as quantum mechanical computations of vibro-rotations, delocalized 
molecular orbitals (MOs), electric dipole moments, hydrogen-bonding, and so on. DNA 
molecular dynamics modeling involves simulations of DNA molecular geometry and 
topology changes with time as a result of both intra- and inter- molecular interactions of 
DNA. Whereas molecular models of Deoxyribonucleic acid (DNA) molecules such as Spinning DNA 

closely packed spheres (CPK models) made of plastic or metal wires for 'skeletal models' are generic model, 

useful representations of static DNA structures, their usefulness is very limited for 

representing complex DNA dynamics. Computer molecular modeling allows both animations and molecular 
dynamics simulations that are very important for understanding how DNA functions in vivo. 




History 
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From the very early stages of structural studies of DNA by X-ray diffraction and biochemical means, molecular 
models such as the Watson-Crick double-helix model were successfully employed to solve the 'puzzle' of DNA 
structure, and also find how the latter relates to its key functions in living cells. The first high quality X-ray 
diffraction patterns of A-DNA were reported by Rosalind Franklin and Raymond Gosling in 1953^ . The first 
calculations of the Fourier transform of an atomic helix were reported one year earlier by Cochran, Crick and Vand 
, and were followed in 1953 by the computation of the Fourier transform of a coiled-coil by Crick 1 . 

Structural information is generated from X-ray diffraction studies of oriented DNA fibers with the help of molecular 
models of DNA that are combined with crystallographic and mathematical analysis of the X-ray patterns. 

Ml 

The first reports of a double-helix molecular model of B-DNA structure were made by Watson and Crick in 1953 
^ . Last-but-not-least, Maurice F. Wilkins, A. Stokes and H.R. Wilson, reported the first X-ray patterns of in vivo 
B-DNA in partially oriented salmon sperm heads ^ . The development of the first correct double-helix molecular 
model of DNA by Crick and Watson may not have been possible without the biochemical evidence for the 
nucleotide base-pairing ([A— T]; [C — G]), or Chargaffs rules' 7 ^ ^ ^ ^ ^ .Although such initial studies of 
DNA structures with the help of molecular models were essentially static, their consequences for explaining the in 
vivo functions of DNA were significant in the areas of protein biosynthesis and the quasi-universality of the genetic 
code. Epigenetic transformation studies of DNA in vivo were however much slower to develop in spite of their 
importance for embryology, morphogenesis and cancer research. Such chemical dynamics and biochemical reactions 
of DNA are much more complex than the molecular dynamics of DNA physical interactions with water, ions and 
proteins/enzymes in living cells. 



Importance 

An old standing dynamic problem is how DNA "self-replication" takes place in living cells that should involve 
transient uncoiling of supercoiled DNA fibers. Although DNA consists of relatively rigid, very large elongated 
biopolymer molecules called "fibers" or chains (that are made of repeating nucleotide units of four basic types, 
attached to deoxyribose and phosphate groups), its molecular structure in vivo undergoes dynamic configuration 
changes that involve dynamically attached water molecules and ions. Supercoiling, packing with histones in 
chromosome structures, and other such supramolecular aspects also involve in vivo DNA topology which is even 
more complex than DNA molecular geometry, thus turning molecular modeling of DNA into an especially 
challenging problem for both molecular biologists and biotechnologists. Like other large molecules and biopolymers, 
DNA often exists in multiple stable geometries (that is, it exhibits conformational isomerism) and configurational, 
quantum states which are close to each other in energy on the potential energy surface of the DNA molecule. 
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Such varying molecular geometries can also be computed, at least in 
principle, by employing ab initio quantum chemistry methods that can 
attain high accuracy for small molecules, although claims that 
acceptable accuracy can be also achieved for polynucleotides, as well 
as DNA conformations, were recently made on the basis of VCD 
spectral data. Such quantum geometries define an important class of ab 
initio molecular models of DNA whose exploration has barely started 
especially in connection with results obtained by VCD in solutions. 
More detailed comparisons with such ab initio quantum computations 
are in principle obtainable through 2D-FT NMR spectroscopy and 
relaxation studies of polynucleotide solutions or specifically labeled DNA, as for example with deuterium labels. 




DNA computing biochip:3D 



In an interesting twist of roles, the DNA molecule itself was proposed to be utilized for quantum computing. Both 
DNA nanostructures as well as DNA 'computing' biochips have been built (see biochip image at left). 



Examples of DNA molecular models 

Animated molecular models allow one to visually explore the three-dimensional (3D) structure of DNA. One 
visualization of DNA model is a space-filling, or CPK, model. Another is a wire, or skeletal type. 

The hydrogen bonding dynamics and proton exchange is very different by many orders of magnitude between the 
two systems of fully hydrated DNA and water molecules in ice. Thus, the DNA dynamics is complex, involving 
nanosecond and several tens of picosecond time scales, whereas that of liquid ice is on the picosecond time scale, 
and that of proton exchange in ice is on the millisecond time scale; the proton exchange rates in DNA and attached 
proteins may vary from picosecond to nanosecond, minutes or years, depending on the exact locations of the 
exchanged protons in the large biopolymers. 

The chemical structure of DNA (Fig.l) is insufficient to understand the complexity of the 3D structures of DNA. On 
the other hand, animated molecular models allow one to visually explore the three-dimensional (3D) structure of 
DNA. The DNA model shown in Fig. 2 of the following gallery is a space-filling, or CPK, model of the DNA 
double-helix, whereas the third is an animated wire, or skeletal type, molecular model of DNA (Fig. 3). The last two 
DNA molecular models in this series depict quadruplex DNA ^ that may be involved in certain cancers ^ ^ 15 ^ . 

A simple harmonic oscillator 'vibration' (Fig. 4) is only an oversimplified dynamic representation of the longitudinal 
vibrations of the DNA intertwined helices which were found to be anharmonic rather than harmonic as often 
assumed in quantum dynamic simulations of DNA. The next figure in this gallery is a molecular model of hydrogen 
bonds between water molecules in ice that are similar to those found in DNA (Fig. 5). Figure 6 shows the X-ray 
Patterns of A- and B- DNA configurations ^ that inspired the 3D double helix molecular model of DNA. Figure 7 
shows an animated 3D model of a four-way DNA junction. The last figure presents a hypothetical, G-quadruplex 
DNA structure that may be involved in causing cancers. 

Gallery 1. 
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DNA Spacefilling molecular model 



DNA structure determination using molecular modeling and X-ray patterns of 
DNA 

After DNA has been separated and purified by standard biochemical techniques one has a sample in a jar as shown 
in Fig.l of Gallery 2. Figure 2 in Gallery 2 specifies the main steps involved in generating structural information 
from X-ray diffraction studies of oriented DNA fibers that are drawn from the hydrated DNA sample (Fig.l) with the 
help of molecular models of DNA that are combined with crystallographic and mathematical analysis of the X-ray 
patterns. Figure 9 is an actual electron micrograph of a DNA fiber bundle, presumably of a single bacterial 
chromosome loop. 

Gallery 2: Illustration of the molecular modeling and X-ray data collection steps involved in the 
determination of DNA molecular structures 



Molecular models of DNA 



68 




Molecular models of DNA 



69 



Paracrystalline lattice models of B-DNA structures 

A paracrystalline lattice, or paracrystal, is a molecular or atomic lattice with significant amounts (e.g., larger than a 
few percent) of partial disordering of molecular arranegements. Limiting cases of the paracrystal model are 
nanostructures, such as glasses, liquids, etc., that may possess only local ordering and no global order. A simple 
example of a paracrystalline lattice is shown in the following figure for a silica glass: 




Liquid crystals also have paracrystalline rather than crystalline structures. 

Highly hydrated B-DNA occurs naturally in living cells in such a paracrystalline state, which is a dynamic one in 

spite of the relatively rigid DNA double-helix stabilized by parallel hydrogen bonds between the nucleotide 

base-pairs in the two complementary, helical DNA chains (see figures). For simplicity most DNA molecular models 

ommit both water and ions dynamically bound to B-DNA, and are thus less useful for understanding the dynamic 

ri7i ri8i 

behaviors of B-DNA in vivo. The physical and mathematical analysis of X-ray and spectroscopic data for 

paracrystalline B-DNA is therefore much more complicated than that of crystalline, A-DNA X-ray diffraction 

patterns. The paracrystal model is also important for DNA technological applications such as DNA nanotechnology. 

Novel techniques that combine X-ray diffraction of DNA with X-ray microscopy in hydrated living cells are now 

also being developed (see, for example, "Application of X-ray microscopy in the analysis of living hydrated cells" 
[19K 



Genomic and biotechnology applications of DNA molecular modeling 

There are various uses of DNA molecular modeling in Genomics and Biotechnology research applications, from 
DNA repair to PCR and DNA nanostructures. Two-dimensional DNA junction arrays have been visualized by 
Atomic force microscopy P 0 ^ 

The following Gallery 3 consists of images that illustrate various uses of DNA molecular modeling in Genomics and 
Biotechnology research applications from DNA repair to PCR and DNA nanostructures; each slide contains its own 
explanation and/or details. The first slide presents an overview of DNA applications, including DNA molecular 
models, with emphasis on Genomics and Biotechnology. 

• The first row: Fig. 2 shows a computer molecular model of RNA polymerase, followed in Figure 3 by that of an E. 
coli, bacterial DNA primase template suggesting very complex dynamics at the interfaces between the enzymes 
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and the DNA template; the next Figures 4 and 5 illustrate in computed molecular models the mutagenic, chemical 
interaction of potent carcinogen molecules with DNA. 

• The last row figures in Gallery 3 present a DNA biochip and DNA nanostructures designed for DNA computing 
and other dynamic applications of DNA nanotechnology; last image in this row is of self-assembled DNA 
nanostructures. The DNA "tile" structure in this image consists of four branched junctions oriented at 90° angles. 
Each tile consists of nine DNA oligonucleotides as shown; such tiles serve as the primary "building block" for the 
assembly of the DNA nanogrids shown in the AFM micrograph. 

Gallery 3: DNA molecular modeling applications in Genomics and Biotehnology 
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Quadruplex DNA may be involved in certain cancers ; see also Figures 8 to 10 above in Gallery 1. 

See also 

DNA biochemistry 

• DNA 

• DNA structure 

• G-quadruplex 

X-ray and neutron crystallography 

• Crystallography 

• Crystal lattices 

• X-ray scattering 

• Sir Lawrence Bragg, FRS 

• List of nucleic acid simulation software 

• Neutron scattering 

• X-ray microscopy 

• Sirius visualization software 

• QMC@Home 



Spectroscopy 

• 2D-FT NMRI and Spectroscopy 

• FT-NMR [23] [24] 

• NMR Atlas-database [25] 

• mmcif downloadable coordinate files of nucleic acids in solution from 2D-FT NMR data ^ 

• NMR constraints files for NAs in PDB format [27] 

T281 

NMR microscopy 
Microwave spectroscopy 
Vibrational circular dichroism (VCD) 
FT-IR 

FT _ MR [29] [30] [31] 

Spectral, Hyperspectral, and Chemical imaging) ^ ^ ^ ^ ^ ^ ^ . 
Raman spectroscopy/microscopy and CARS 
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Fluorescence correlation spectroscopy^ 39 ^ ^ ^ ^ ^ ^ ^ ^ , Fluorescence cross-correlation 
spectroscopy and FRET [45] [46] [47] . 

T391 

Confocal microscopy 
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External links 

• DNA the Double Helix Game ^ 152 ^ From the official Nobel Prize web site 

• MDDNA: Structural Bioinformatics of DNA [50] 

• Double Helix 1953-2003 [156] National Centre for Biotechnology Education 

• DN Alive: a web interface to compute DNA physical properties Also allows cross-linking of the results with 
the UCSC Genome browser and DNA dynamics. 

• DiProDB: Dinucleotide Property Database . The database is designed to collect and analyse thermodynamic, 
structural and other dinucleotide properties. 

• Further details of mathematical and molecular analysis of DNA structure based on X-ray data 

• Bessel functions corresponding to Fourier transforms of atomic or molecular helices. t54] 

ri9i 

• Application of X-ray microscopy in analysis of living hydrated cells 

• overview of STM/AFM/SNOM principles with educative videos t55] 

Databases for DNA molecular models and sequences 
X-ray diffraction 

• NDB ID: UD0017 Database [13] 

• X-ray Atlas -database ^ 

• PDB files of coordinates for nucleic acid structures from X-ray diffraction by NA (incl. DNA) crystals 

• Structure factors dowloadable files in CIF format ^ 

Neutron scattering 

• ISIS neutron source: ISIS pulsed neutron source:A world centre for science with neutrons & muons at Harwell, 
near Oxford, UK. [59] 

X-ray microscopy 

ri9i 

• Application of X-ray microscopy in the analysis of living hydrated cells 
Electron microscopy 

• DNA under electron microscope ^ 153 ^ 
Genomic and structural databases 

• CBS Genome Atlas Database ^ — contains examples of base skews. 

• The Z curve database of genomes — a 3-dimensional visualization and analysis tool of genomes ^ 6l \ 

• DNA and other nucleic acids' molecular models: Coordinate files of nucleic acids molecular structure models in 
PDB and CIF formats [62] 



Atomic force microscopy 

• How SPM Works [63] 

• SPM Image Gallery - AFM STM SEM MFM NSOM and more. [64] 
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DNA replication 



DNA replication, the basis for biological inheritance, is a 
fundamental process occurring in all living organisms to copy their 
DNA. This process is "replication" in that each strand of the original 
double- stranded DNA molecule serves as template for the 
reproduction of the complementary strand. Hence, following DNA 
replication, two identical DNA molecules have been produced from a 
single double- stranded DNA molecule. Cellular proofreading and error 
toe-checking mechanisms ensure near perfect fidelity for DNA 
replication. 



[1] [2] 



In a cell, DNA replication begins at specific locations in the genome, 
[31 

called "origins". Unwinding of DNA at the origin, and synthesis of 
new strands, forms a replication fork. In addition to DNA polymerase, 
the enzyme that synthesizes the new DNA by adding nucleotides 
matched to the template strand, a number of other proteins are 
associated with the fork and assist in the initiation and continuation of 
DNA synthesis. 

DNA replication can also be performed in vitro (outside a cell). DNA 
polymerases, isolated from cells, and artificial DNA primers are used 
to initiate DNA synthesis at known sequences in a template molecule. 
The polymerase chain reaction (PCR), a common laboratory 
technique, employs such artificial synthesis in a cyclic manner to 
amplify a specific target DNA fragment from a pool of DNA. 




DNA replication. The double helix is unwound 
and each strand acts as a template. Bases are 
matched to synthesize the new partner strands. 
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DNA structure 



DNA usually exists as a double- stranded structure, with both 
strands coiled together to form the characteristic double-helix. 
Each single strand of DNA is a chain of four types of nucleotides: 
adenine, cytosine, guanine, and thymine. A nucleotide is a mono-, 
di- or triphosphate deoxyribonucleoside; that is, a deoxyribose 
sugar is attached to one, two or three phosphates. Chemical 
interaction of these nucleotides forms phosphodiester linkages, 
creating the phosphate-deoxribose backbone of the DNA double 
helix with the bases pointing inward. Nucleotides (bases) are 
matched between strands through hydrogen bonds to form base 
pairs. Adenine pairs with thymine and cytosine pairs with guanine. 



Adenine 



Thymine 



3' end 




Phosphate 
deoxyribose' js^ 
backbone 



3 1 end Cytosine / 



3' end CytOSil 

Guanine 

The chemical structure of DNA. 



5' end 



DNA strands have a directionality, and the different ends of a 
single strand are called the M 3' (three-prime) end" and the M 5' 
(five-prime) end." These terms refer to the carbon atom in 
deoxyribose to which the next phosphate in the chain attaches. In 
addition to being complementary, the two strands of DNA are 
antiparallel: they are oriented in opposite directions. This directionality has consequences in DNA synthesis, because 
DNA polymerase can only synthesize DNA in one direction by adding nucleotides to the 3' end of a DNA strand. 

The pairing of bases in DNA through hydrogen bonding means that the information contained within each strand is 
redundant. The nucleotides on a single strand can be used to reconstruct nucleotides on a newly synthesized partner 
strand. ^ 
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DNA polymerase 



DNA polymerases are a family of enzymes that carry out all forms of 
DNA replication/ 5 ^ A DNA polymerase can only extend an existing 
DNA strand paired with a template strand; it cannot begin the 
synthesis of a new strand. To begin synthesis of a new strand, a short 
fragment of DNA or RNA, called a primer, must be created and paired 
with the template strand before DNA polymerase can synthesize new 
DNA. 

Once a primer pairs with DNA to be replicated, DNA polymerase 
synthesizes a new strand of DNA by extending the 3' end of an 
existing nucleotide chain, adding new nucleotides matched to the 
template strand one at a time via the creation of phosphodiester bonds. 
The energy for this process of DNA polymerization comes from two 
of the three total phosphates attached to each unincorporated base. 
(Free bases with their attached phosphate groups are called nucleoside 
triphosphates.) When a nucleotide is being added to a growing DNA 
strand, two of the phosphates are removed and the energy produced 
creates a phosphodiester (chemical) bond that attaches the remaining 
phosphate to the growing chain. The energetics of this process also 
help explain the directionality of synthesis - if DNA were synthesized 
in the 3' to 5' direction, the energy for the process would come from 
the 5' end of the growing strand rather than from free nucleotides. 

DNA polymerases are generally extremely accurate, making less than 
one error for every 10 7 nucleotides added Even so, some DNA 
polymerases also have proofreading ability; they can remove 
nucleotides from the end of a strand in order to correct mismatched 
bases. If the 5' nucleotide needs to be removed during proofreading, 
the triphosphate end is lost. Hence, the energy source that usually 
provides energy to add a new nucleotide is also lost. 



DNA polymerase 



T| nucleoside 
triphosphate 




Extension 





DNA polymerase adds nucleotides to the 3' end of 
a strand of DNA. If a mismatch is accidentally 
incorporated, the polymerase is inhibited from 
further extension. Proofreading removes the 
mismatched nucleotide and extension continues. 



DNA replication within the cell 

Origins of replication 

T71 

For a cell to divide, it must first replicate its DNA. L This process is initiated at particular points within the DNA, 
known as "origins", which are targeted by proteins that separate the two strands and initiate DNA synthesis. 
Origins contain DNA sequences recognized by replication initiator proteins (eg. dnaA in E coif and the Origin 

T81 

Recognition Complex in yeast). These initiator proteins recruit other proteins to separate the two strands and 
initiate replication forks. 

Initiator proteins recruit other proteins to separate the DNA strands at the origin, forming a bubble. Origins tend to 
be "AT-rich" (rich in adenine and thymine bases) to assist this process, because A-T base pairs have two hydrogen 
bonds (rather than the three formed in a C-G pair) — strands rich in these nucleotides are generally easier to separate 

rm 

due the positive relationship between the number of hydrogen bonds and the difficulty of breaking these bonds. 
Once strands are separated, RNA primers are created on the template strands. More specifically, the leading strand 
receives one RNA primer per active origin of replication while the lagging strand receives several; these several 
fragments of RNA primers found on the lagging strand of DNA are called Okazaki fragments, named after their 
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discoverer. DNA polymerase extends the leading strand in one continuous motion and the lagging strand in a 
discontinuous motion (due to the Okazaki fragments). RNase removes the RNA fragments used to initiate replication 
by DNA Polymerase, and another DNA Polymerase enters to fill the gaps. When this is complete, a single nick on 
the leading strand and several nicks on the lagging strand can be found. Ligase works to fill these nicks in, thus 
completing the newly replicated DNA molecule. 

As DNA synthesis continues, the original DNA strands continue to unwind on each side of the bubble, forming 2 
replication forks. In bacteria, which have a single origin of replication on their circular chromosome, this process 
eventually creates a "theta structure" (resembling the Greek letter theta: 9). In contrast, eukaryotes have longer linear 
chromosomes and initiate replication at multiple origins within these. 

The replication fork 

When replicating, the original DNA splits in 
two, forming two "prongs" which resemble 
a fork (hence the name "replication fork"). 
DNA has a ladder-like structure; imagine a 
ladder broken in half vertically, along the 
steps. Each half of the ladder now requires a 
new half to match it. Because DNA 
polymerase can only synthesize a new DNA 
strand in a 5' to 3' manner, the process of 
replication goes differently for the two 
strands comprising the DNA double helix. 

Leading strand 

The leading strand is that strand of the DNA double helix that is oriented in a 5' to 3' manner. 

On the leading strand, a polymerase "reads" the DNA and adds nucleotides to it continuously. This polymerase is 
DNA polymerase III (DNA Pol III) in prokaryotes and presumably Pol in eukaryotes. 

Lagging strand 

The lagging strand is that strand of the DNA double helix that is orientated in a 3' to 5' manner. Because of its 
orientation, opposite to the working orientation of DNA polymerase III which is in a 3' to 5' manner, replication of 
the lagging strand is more complicated than that of the leading strand. 

On the lagging strand, primase "reads" the DNA and adds RNA to it in short, separated segments. In eukaryotes, 

primase is intrinsic to Pol a. DNA polymerase III or Pol 6 lengthens the primed segments, forming Okazaki 

ri3i 

fragments. Primer removal in eukaryotes is also performed by Pol 6. In prokaryotes, DNA polymerase I "reads" 
the fragments, removes the RNA using its flap endonuclease domain, and replaces the RNA nucleotides with DNA 
nucleotides (this is necessary because RNA and DNA use slightly different kinds of nucleotides). DNA ligase joins 
the fragments together. 
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Dynamics at the replication fork 

As helicase unwinds DNA at the replication fork, the DNA ahead 
is forced to rotate. This process results in a build-up of twists in 
the DNA ahead J 14 ^ This build-up would form a resistance that 
would eventually halt the progress of the replication fork. DNA 
topoisomerases are enzymes that solve these physical problems in 
the coiling of DNA. Topoisomerase I cuts a single backbone on 
the DNA, enabling the strands to swivel around each other to 
remove the build-up of twists. Topoisomerase II cuts both 
backbones, enabling one double- stranded DNA to pass through 
another, thereby removing knots and entanglements that can form 
within and between DNA molecules. 

Bare single-stranded DNA has a tendency to fold back upon itself 
and form secondary structures; these structures can interfere with 
the movement of DNA polymerase. To prevent this, single- strand 
binding proteins bind to the DNA until a second strand is 
synthesized, preventing secondary structure formation.^ 




The assembled human DNA clamp, a trimer of the 
protein PCNA. 



Clamp proteins form a sliding clamp around DNA, helping the DNA polymerase maintain contact with its template 
and thereby assisting with processivity. The inner face of the clamp enables DNA to be threaded through it. Once the 
polymerase reaches the end of the template or detects double stranded DNA, the sliding clamp undergoes a 
conformational change which releases the DNA polymerase. Clamp-loading proteins are used to initially load the 
clamp, recognizing the junction between template and RNA primers. 




Regulation of replication 

Eukaryotes 

Within eukaryotes, DNA replication is controlled within the context of the 
cell cycle. As the cell grows and divides, it progresses through stages in the 
cell cycle; DNA replication occurs during the S phase (Synthesis phase). The 
progress of the eukaryotic cell through the cycle is controlled by cell cycle 
checkpoints. Progression through checkpoints is controlled through complex 
interactions between various proteins, including cyclins and cyclin-dependent 
kinases J 

The Gl/S checkpoint (or restriction checkpoint) regulates whether eukaryotic 
cells enter the process of DNA replication and subsequent division. Cells 
which do not proceed through this checkpoint are quiescent in the "GO" stage 
and do not replicate their DNA. 

Replication of chloroplast and mitochondrial genomes occurs independent of the cell cycle, through the process of 
D-loop replication. 

Bacteria 

Most bacteria do not go through a well-defined cell cycle and instead continuously copy their DNA; during rapid 
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growth this can result in multiple rounds of replication occurring concurrently. Within E coli, the most 
well-characterized bacteria, regulation of DNA replication can be achieved through several mechanisms, including: 
the hemimethylation and sequestering of the origin sequence, the ratio of ATP to ADP, and the levels of protein 
DnaA. These all control the process of initiator proteins binding to the origin sequences. 



The cell cycle of eukaryotic cells. 
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Because E coli methylates GATC DNA sequences, DNA synthesis results in hemimethylated sequences. This 
hemimethylated DNA is recognized by a protein (SeqA) which binds and sequesters the origin sequence; in addition, 
dnaA (required for initiation of replication) binds less well to hemimethylated DNA. As a result, newly replicated 
origins are prevented from immediately initiating another round of DNA replication. tl8] 

ATP builds up when the cell is in a rich medium, triggering DNA replication once the cell has reached a specific 
size. ATP competes with ADP to bind to DnaA, and the DnaA- ATP complex is able to initiate replication. A certain 
number of DnaA proteins are also required for DNA replication — each time the origin is copied the number of 
binding sites for DnaA doubles, requiring the synthesis of more DnaA to enable another initiation of replication. 

Termination of replication 

Because bacteria have circular chromosomes, termination of replication occurs when the two replication forks meet 
each other on the opposite end of the parental chromosome. E coli regulate this process through the use of 
termination sequences which, when bound by the Tus protein, enable only one direction of replication fork to pass 
through. As a result, the replication forks are constrained to always meet within the termination region of the 
chromosome^ 

Eukaryotes initiate DNA replication at multiple points in the chromosome, so replication forks meet and terminate at 
many points in the chromosome; these are not known to be regulated in any particular manner. Because eukaryotes 
have linear chromosomes, DNA replication often fails to synthesize to the very end of the chromosomes (telomeres), 
resulting in telomere shortening. This is a normal process in somatic cells — cells are only able to divide a certain 
number of times before the DNA loss prevents further division. (This is known as the Hay flick limit.) Within the 
germ cell line, which passes DNA to the next generation, telomerase extends the repetitive sequences of the telomere 
region to prevent degradation. Telomerase can become mistakenly active in somatic cells, sometimes leading to 
cancer formation. 



Polymerase chain reaction 

Researchers commonly replicate DNA in vitro using the polymerase chain reaction (PCR). PCR uses a pair of 
primers to span a target region in template DNA, and then polymerizes partner strands in each direction from these 
primers using a thermostable DNA polymerase. Repeating this process through multiple cycles produces 
amplification of the targeted DNA region. At the start of each cycle, the mixture of template and primers is heated, 
separating the newly synthesized molecule and template. Then, as the mixture cools, both of these become templates 
for annealing of new primers, and the polymerase extends from these. As a result, the number of copies of the target 
region doubles each round, increasing exponentially. t20] 
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DNA repair 



DNA repair refers to a collection of processes by which a cell 
identifies and corrects damage to the DNA molecules that encode its 
genome. In human cells, both normal metabolic activities and 
environmental factors such as UV light and radiation can cause DNA 
damage, resulting in as many as 1 million individual molecular lesions 
per cell per dayJ 1 ^ Many of these lesions cause structural damage to 
the DNA molecule and can alter or eliminate the cell's ability to 
transcribe the gene that the affected DNA encodes. Other lesions 
induce potentially harmful mutations in the cell's genome, which affect 
the survival of its daughter cells after it undergoes mitosis. 
Consequently, the DNA repair process is constantly active as it 
responds to damage in the DNA structure. When normal repair 
processes fail, and when cellular apoptosis does not occur, irreparable 
DNA damage may occur, including double- strand breaks and DNA 
crosslinkages.^ ^ 

The rate of DNA repair is dependent on many factors, including the 
cell type, the age of the cell, and the extracellular environment. A cell 
that has accumulated a large amount of DNA damage, or one that no 
longer effectively repairs damage incurred to its DNA, can enter one of three possible states: 

1 . an irreversible state of dormancy, known as senescence 

2. cell suicide, also known as apoptosis or programmed cell death 

3. unregulated cell division, which can lead to the formation of a tumor that is cancerous 

The DNA repair ability of a cell is vital to the integrity of its genome and thus to its normal functioning and that of 
the organism. Many genes that were initially shown to influence life span have turned out to be involved in DNA 
damage repair and protection.^ Failure to correct molecular lesions in cells that form gametes can introduce 
mutations into the genomes of the offspring and thus influence the rate of evolution. 

DNA damage 

DNA damage, due to environmental factors and normal metabolic processes inside the cell, occurs at a rate of 1,000 
to 1,000,000 molecular lesions per cell per dayJ 1 ^ While this constitutes only 0.000165% of the human genome's 
approximately 6 billion bases (3 billion base pairs), unrepaired lesions in critical genes (such as tumor suppressor 
genes) can impede a cell's ability to carry out its function and appreciably increase the likelihood of tumor formation. 

The vast majority of DNA damage affects the primary structure of the double helix; that is, the bases themselves are 
chemically modified. These modifications can in turn disrupt the molecules' regular helical structure by introducing 
non-native chemical bonds or bulky adducts that do not fit in the standard double helix. Unlike proteins and RNA, 
DNA usually lacks tertiary structure and therefore damage or disturbance does not occur at that level. DNA is, 
however, supercoiled and wound around "packaging" proteins called histones (in eukaryotes), and both 
superstructures are vulnerable to the effects of DNA damage. 
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Sources of damage 

DNA damage can be subdivided into two main types: 

1 . endogenous damage such as attack by reactive oxygen species produced from normal metabolic byproducts 
(spontaneous mutation), especially the process of oxidative deamination; 

1. also includes replication errors 

2. exogenous damage caused by external agents such as 

1. ultraviolet [UV 200-300nm] radiation from the sun 

2. other radiation frequencies, including x-rays and gamma rays 

3. hydrolysis or thermal disruption 

4. certain plant toxins 

5. human-made mutagenic chemicals, especially aromatic compounds that act as DNA intercalating agents 

6. cancer chemotherapy and radiotherapy 

n . |5| 

7. viruses 

The replication of damaged DNA before cell division can lead to the incorporation of wrong bases opposite damaged 
ones. Daughter cells that inherit these wrong bases carry mutations from which the original DNA sequence is 
unrecoverable (except in the rare case of a back mutation, for example, through gene conversion). 

Types of damage 

There are five main types of damage to DNA due to endogenous cellular processes: 

1. oxidation of bases [e.g. 8-oxo-7,8-dihydroguanine (8-oxoG)] and generation of DNA strand interruptions from 
reactive oxygen species, 

2. alkylation of bases (usually methylation), such as formation of 7-methylguanine, 1-methyladenine, 
6-O-Methylguanine 

3. hydrolysis of bases, such as deamination, depurination and depyrimidination. 

4. "bulky adduct formation" (i.e. benzo[a]pyrene diol epoxide-dG adduct) 

5. mismatch of bases, due to errors in DNA replication, in which the wrong DNA base is stitched into place in a 
newly forming DNA strand, or a DNA base is skipped over or mistakenly inserted. 

Damage caused by exogenous agents comes in many forms. Some examples are: 

1. UV-B light causes crosslinking between adjacent cytosine and thymine bases creating pyrimidine dimers. This is 
called direct DNA damage. 

2. UV-A light creates mostly free radicals. The damage caused by free radicals is called indirect DNA damage. 

3. Ionizing radiation such as that created by radioactive decay or in cosmic rays causes breaks in DNA strands. 

4. Thermal disruption at elevated temperature increases the rate of depurination (loss of purine bases from the DNA 
backbone) and single strand breaks. For example, hydrolytic depurination is seen in the thermophilic bacteria, 
which grow in hot springs at 40-80 The rate of depurination (300 purine residues per genome per 
generation) is too high in these species to be repaired by normal repair machinery, hence a possibility of an 
adaptive response cannot be ruled out. 

5. Industrial chemicals such as vinyl chloride and hydrogen peroxide, and environmental chemicals such as 
poly cyclic hydrocarbons found in smoke, soot and tar create a huge diversity of DNA adducts- ethenobases, 
oxidized bases, alkylated phosphotriesters and Crosslinking of DNA just to name a few. 

UV damage, alkylation/methylation, X-ray damage and oxidative damage are examples of induced damage. 

rsi 

Spontaneous damage can include the loss of a base, deamination, sugar ring puckering and tautomeric shift. 
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Nuclear versus mitochondrial DNA damage 

In human cells, and eukaryotic cells in general, DNA is found in two cellular locations - inside the nucleus and 
inside the mitochondria. Nuclear DNA (nDNA) exists as chromatin during non-replicative stages of the cell cycle 
and is condensed into aggregate structures known as chromosomes during cell division. In either state the DNA is 
highly compacted and wound up around bead-like proteins called histones. Whenever a cell needs to express the 
genetic information encoded in its nDNA the required chromosomal region is unravelled, genes located therein are 
expressed, and then the region is condensed back to its resting conformation. Mitochondrial DNA (mtDNA) is 
located inside mitochondria organelles, exists in multiple copies, and is also tightly associated with a number of 
proteins to form a complex known as the nucleoid. Inside mitochondria, reactive oxygen species (ROS), or free 
radicals, byproducts of the constant production of adenosine triphosphate (ATP) via oxidative phosphorylation, 
create a highly oxidative environment that is known to damage mtDNA. A critical enzyme in counteracting the 
toxicity of these species is superoxide dismutase, which is present in both the mitochondria and cytoplasm of 
eukaryotic cells. 

Senescence and apoptosis 

Senescence, an irreversible state in which the cell no longer divides, is a protective response to the shortening of the 
chromosome ends. The telomeres are long regions of repetitive noncoding DNA that cap chromosomes and undergo 
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partial degradation each time a cell undergoes division (see Hay flick limit). In contrast, quiescence is a reversible 
state of cellular dormancy that is unrelated to genome damage (see cell cycle). Senescence in cells may serve as a 
functional alternative to apoptosis in cases where the physical presence of a cell for spatial reasons is required by the 
organism, ^ which serves as a "last resort" mechanism to prevent a cell with damaged DNA from replicating 
inappropriately in the absence of pro-growth cellular signaling. Unregulated cell division can lead to the formation of 
a tumor (see cancer), which is potentially lethal to an organism. Therefore the induction of senescence and apoptosis 
is considered to be part of a strategy of protection against cancer P ^ 

DNA damage and mutation 

It is important to distinguish between DNA damage and mutation, the two major types of error in DNA. DNA 
damages and mutation are fundamentally different. Damages are physical abnormalities in the DNA, such as single 
and double strand breaks, 8-hydroxydeoxyguanosine residues and poly cyclic aromatic hydrocarbon adducts. DNA 
damages can be recognized by enzymes, and thus they can be correctly repaired if redundant information, such as the 
undamaged sequence in the complementary DNA strand or in a homologous chromosome, is available for copying. 
If a cell retains DNA damage, transcription of a gene can be prevented and thus translation into a protein will also be 
blocked. Replication may also be blocked and/or the cell may die. 

In contrast to DNA damage, a mutation is a change in the base sequence of the DNA. A mutation cannot be 
recognized by enzymes once the base change is present in both DNA strands, and thus a mutation cannot be repaired. 
At the cellular level, mutations can cause alterations in protein function and regulation. Mutations are replicated 
when the cell replicates. In a population of cells, mutant cells will increase or decrease in frequency according to the 
effects of the mutation on the ability of the cell to survive and reproduce. Although distinctly different from each 
other, DNA damages and mutations are related because DNA damages often cause errors of DNA synthesis during 
replication or repair and these errors are a major source of mutation. 

Given these properties of DNA damage and mutation, it can be seen that DNA damages are a special problem in 
non-dividing or slowly dividing cells, where unrepaired damages will tend to accumulate over time. On the other 
hand, in rapidly dividing cells, unrepaired DNA damages that do not kill the cell by blocking replication will tend to 
cause replication errors and thus mutation. The great majority of mutations that are not neutral in their effect are 
deleterious to a cell's survival. Thus, in a population of cells comprising a tissue with replicating cells, mutant cells 
will tend to be lost. However infrequent mutations that provide a survival advantage will tend to clonally expand at 
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the expense of neighboring cells in the tissue. This advantage to the cell is disadvantageous to the whole organism, 
because such mutant cells can give rise to cancer. Thus DNA damages in frequently dividing cells, because they give 
rise to mutations, are a prominent cause of cancer. In contrast, DNA damages in infrequently dividing cells are likely 
a prominent cause of aging J 



DNA repair mechanisms 




Cells cannot function if DNA damage corrupts the integrity and accessibility of 
essential information in the genome (but cells remain superficially functional when 
so-called "non-essential" genes are missing or damaged). Depending on the type of 
damage inflicted on the DNA's double helical structure, a variety of repair 
strategies have evolved to restore lost information. If possible, cells use the 
unmodified complementary strand of the DNA or the sister chromatid as a template 
to recover the original information. Without access to a template, cells use an 
error-prone recovery mechanism known as translesion synthesis as a last resort. 

Damage to DNA alters the spatial configuration of the helix and such alterations 
can be detected by the cell. Once damage is localized, specific DNA repair 
molecules bind at or near the site of damage, inducing other molecules to bind and 
form a complex that enables the actual repair to take place. The types of molecules 
involved and the mechanism of repair that is mobilized depend on the type of 
damage that has occurred and the phase of the cell cycle that the cell is in. 



Direct reversal 

Cells are known to eliminate three types of damage to their DNA by chemically 
reversing it. These mechanisms do not require a template, since the types of 
damage they counteract can only occur in one of the four bases. Such direct 
reversal mechanisms are specific to the type of damage incurred and do not involve 
breakage of the phosphodiester backbone. The formation of pyrimidine dimers 
upon irradiation with UV light results in an abnormal covalent bond between 
adjacent pyrimidine bases. The photoreactivation process directly reverses this 
damage by the action of the enzyme photolyase, whose activation is obligately 
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dependent on energy absorbed from blue/UV light (300-500 nm wavelength) to promote catalysis. Another type 
of damage, methylation of guanine bases, is directly reversed by the protein methyl guanine methyl transferase 
(MGMT), the bacterial equivalent of which is called ogt. This is an expensive process because each MGMT 
molecule can only be used once; that is, the reaction is stoichiometric rather than catalytic J 14 ^ A generalized 
response to methylating agents in bacteria is known as the adaptive response and confers a level of resistance to 
alkylating agents upon sustained exposure by upregulation of alkylation repair enzymes J 15 ^ The third type of DNA 
damage reversed by cells is certain methylation of the bases cytosine and adenine. 




Single strand and double strand 
DNA damage 
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Single strand damage 




When only one of the two strands of a double helix has 
a defect, the other strand can be used as a template to 
guide the correction of the damaged strand. In order to 
repair damage to one of the two paired molecules of 
DNA, there exist a number of excision repair 
mechanisms that remove the damaged nucleotide and 
replace it with an undamaged nucleotide 
complementary to that found in the undamaged DNA 
strand J 14 ^ 



Structure of the base-excision repair enzyme uracil-DNA 
glycosylase. The uracil residue is shown in yellow. 



1. Base excision repair (BER), which repairs damage 
to a single base caused by oxidation, alkylation, 
hydrolysis, or deamination. The damaged base is 
removed by a DNA glycosylase. The "missing 
tooth" is then recognised by an enzyme called AP 



endonuclease, which cuts the Phosphodiester bond. The missing part is then resynthesized by a DNA polymerase, 
and a DNA ligase performs the final nick- sealing step. 

2. Nucleotide excision repair (NER), which recognizes bulky, helix-distorting lesions such as pyrimidine dimers and 
6,4 photoproducts. A specialized form of NER known as transcription-coupled repair deploys NER enzymes to 
genes that are being actively transcribed. 

3. Mismatch repair (MMR), which corrects errors of DNA replication and recombination that result in mispaired 
(but undamaged) nucleotides. 

Double-strand breaks 

A double-strand break (DSBs) occurs in one of the paired DNAs followed by enzymatic trimming back of 
nucleotides on the new single- strand ends. 

A free 3' end invades the unbroken helix and displaces a loop of single strand DNA. A DNA polymerase elongates 
the free 3' end of the invading strand, further displacing the looped out strand, which then pairs with an exposed 
single-strand on the opposing helix. 

The displaced strand serves as a template for enzymatic extension from the 3' end of the paired single strand, which 
eventually crosses the junction and switches templates. As 3' and 5' ends meet, strands join to form two Holliday 
junctions. 

There are two ways to resolve each Holliday junction by single cleavage and rejoining, so there are four ways to 
resolve the double Holliday structure by two cleavages and rejoinings. 

Two of these combinations of cleavage and rejoining generate non-crossover recombinants. The other two 
combinations of cleavage and rejoining generate crossover recombinants. All recombinants include heteroduplex 
segments. 

Double- strand breaks, in which both strands in the double helix are severed, are particularly hazardous to the cell 
because they can lead to genome rearrangements. Three mechanisms exist to repair DSBs: non-homologous end 
joining (NHEJ), microhomology-mediated end joining (MMEJ) and homologous recombination J 
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In NHEJ, DNA Ligase IV, a specialized DNA ligase that 

forms a complex with the cof actor XRCC4, directly joins 

the two ends J To guide accurate repair, NHEJ relies on 

short homologous sequences called microhomologies 

present on the single- stranded tails of the DNA ends to be 

joined. If these overhangs are compatible, repair is usually 

accurate/ 17 ^ ^ ^ ^ NHEJ can also introduce mutations 

during repair. Loss of damaged nucleotides at the break site 

can lead to deletions, and joining of nonmatching termini 

forms translocations. NHEJ is especially important before 

the cell has replicated its DNA, since there is no template 

available for repair by homologous recombination. There 
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are "backup" NHEJ pathways in higher eukaryotes. 
Besides its role as a genome caretaker, NHEJ is required for 
joining hairpin-capped double- strand breaks induced during 
V(D)J recombination, the process that generates diversity in 
B-cell and T-cell receptors in the vertebrate immune 

[22] 

system. 1 




DNA ligase, shown above repairing chromosomal damage, is 
an enzyme that joins broken nucleotides together by catalyzing 
the formation of an internucleotide ester bond between the 
phosphate backbone and the deoxyribose nucleotides. 



Homologous recombination requires the presence of an 
identical or nearly identical sequence to be used as a 
template for repair of the break. The enzymatic machinery 
responsible for this repair process is nearly identical to the 

machinery responsible for chromosomal crossover during meiosis. This pathway allows a damaged chromosome to 
be repaired using a sister chromatid (available in G2 after DNA replication) or a homologous chromosome as a 
template. DSBs caused by the replication machinery attempting to synthesize across a single-strand break or 
unrepaired lesion cause collapse of the replication fork and are typically repaired by recombination. 

Topoisomerases introduce both single- and double- strand breaks in the course of changing the DNA's state of 
supercoiling, which is especially common in regions near an open replication fork. Such breaks are not considered 
DNA damage because they are a natural intermediate in the topoisomerase biochemical mechanism and are 
immediately repaired by the enzymes that created them. 

A team of French researchers bombarded Deinococcus radiodurans to study the mechanism of double-strand break 

DNA repair in that organism. At least two copies of the genome, with random DNA breaks, can form DNA 

fragments through annealing. Partially overlapping fragments are then used for synthesis of homologous regions 

through a moving D-loop that can continue extension until they find complementary partner strands. In the final step 
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there is crossover by means of RecA-dependent homologous recombination. 



Translesion synthesis 

Translesion synthesis is a DNA damage tolerance process that allows the DNA replication machinery to replicate 
past DNA lesions such as thymine dimers or AP sites J 24 ^ It involves switching out regular DNA polymerases for 
specialized translesion polymerases (e.g. DNA polymerase V), often with larger active sites that can facilitate the 
insertion of bases opposite damaged nucleotides. The polymerase switching is thought to be mediated by, among 
other factors, the post-translational modification of the replication processivity factor PCNA. Translesion synthesis 
polymerases often have low fidelity (high propensity to insert wrong bases) relative to regular polymerases. 
However, many are extremely efficient at inserting correct bases opposite specific types of damage. For example, 
Pol T| mediates error-free bypass of lesions induced by UV irradiation, whereas Pol £, introduces mutations at these 
sites. From a cellular perspective, risking the introduction of point mutations during translesion synthesis may be 
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preferable to resorting to more drastic mechanisms of DNA repair, which may cause gross chromosomal aberrations 
or cell death. 

Global response to DNA damage 

Cells exposed to ionizing radiation, ultraviolet light or chemicals are prone to acquire multiple sites of bulky DNA 
lesions and double strand breaks. Moreover, DNA damaging agents can damage other biomolecules such as proteins, 
carbohydrates, lipids and RNA. The accumulation of damage, specifically double strand breaks or adducts stalling 
the replication forks, are among known stimulation signals for a global response to DNA damage. The global 
response to damage is an act directed toward the cells' own preservation and triggers multiple pathways of 
macromolecular repair, lesion bypass, tolerance or apoptosis. The common features of global response are induction 
of multiple genes, cell cycle arrest, and inhibition of cell division. 

DNA damage checkpoints 

After DNA damage, cell cycle checkpoints are activated. Checkpoint activation pauses the cell cycle and gives the 
cell time to repair the damage before continuing to divide. DNA damage checkpoints occur at the Gl/S and G2/M 
boundaries. An intra-S checkpoint also exists. Checkpoint activation is controlled by two master kinases, ATM and 
ATR. ATM responds to DNA double-strand breaks and disruptions in chromatin structure, ^ whereas ATR 
primarily responds to stalled replication forks. These kinases phosphorylate downstream targets in a signal 
transduction cascade, eventually leading to cell cycle arrest. A class of checkpoint mediator proteins including 
BRCA1, MDC1, and 53BP1 has also been identified. These proteins seem to be required for transmitting the 
checkpoint activation signal to downstream proteins. 

p53 is an important downstream target of ATM and ATR, as it is required for inducing apoptosis following DNA 
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damage. At the Gl/S checkpoint, p53 functions by deactivating the CDK2/cyclin E complex. Similarly, p21 
mediates the G2/M checkpoint by deactivating the CDKl/cyclin B complex. 

The prokaryotic SOS response 

The SOS response is the term used to describe changes in gene expression in Escherichia coli and other bacteria in 
response to extensive DNA damage. The prokaryotic SOS system is regulated by two key proteins: LexA and RecA. 
The LexA homodimer is a transcriptional repressor that binds to operator sequences commonly referred to as SOS 
boxes. In Escherichia coli it is known that LexA regulates transcription of approximately 48 genes including the 
lexA and recA genes. The SOS response is known to be widespread in the Bacteria domain, but it is mostly 
absent in some bacterial phyla, like the Spirochetes.™ The most common cellular signals activating the SOS 
response are regions of single stranded DNA (ssDNA), arising from stalled replication forks or double strand breaks, 
which are processed by DNA helicase to separate the two DNA strands. In the initiation step, RecA protein binds 
to ssDNA in an ATP hydrolysis driven reaction creating RecA-ssDNA filaments. RecA-ssDNA filaments activate 
LexA autoprotease activity which ultimately leads to cleavage of LexA dimer and subsequent LexA degradation. 
The loss of LexA repressor induces transcription of the SOS genes and allows for further signal induction, inhibition 
of cell division and an increase in levels of proteins responsible for damage processing. 

In Escherichia coli, SOS boxes are 20-nucleotide long sequences near promoters with palindromic structure and a 
high degree of sequence conservation. In other classes and phyla, the sequence of SOS boxes varies considerably, 
with different length and composition, but it is always highly conserved and one of the strongest short signals in the 
genome. ™ The high information content of SOS boxes permits differential binding of LexA to different promoters 
and allows for timing of the SOS response. Logically, the lesion repair genes are induced at the beginning of SOS 
response. The error prone translesion polymerases, for example: UmuCD'2 (also called DNA polymerase V), are 
induced later on as a last resort. Once the DNA damage is repaired or bypassed using polymerases or through 
recombination, the amount of single- stranded DNA in cells is decreased, lowering the amounts of RecA filaments 
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decreases cleavage activity of LexA homodimer which subsequently binds to the SOS boxes near promoters and 
restores normal gene expression. 

Eukaryotic transcriptional responses to DNA damage 

Eukaryotic cells exposed to DNA damaging agents also activate important defensive pathways by inducing multiple 

proteins involved in DNA repair, cell cycle checkpoint control, protein trafficking and degradation. Such genome 

wide transcriptional response is very complex and tightly regulated, thus allowing coordinated global response to 

damage. Exposure of yeast Saccharomyces cerevisiae to DNA damaging agents results in overlapping but distinct 

transcriptional profiles. Similarities to environmental shock response indicates that a general global stress response 

pathway exist at the level of transcriptional activation. In contrast, different human cell types respond to damage 

differently indicating an absence of a common global response. The probable explanation for this difference between 

yeast and human cells may be in the heterogeneity of mammalian cells. In an animal different types of cells are 
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distributed amongst different organs which have evolved different sensitivities to DNA damage. 

In general global response to DNA damage involves expression of multiple genes responsible for postreplication 
repair, homologous recombination, nucleotide excision repair, DNA damage checkpoint, global transcriptional 
activation, genes controlling mRNA decay and many others. A large amount of damage to a cell leaves it with an 
important decision: undergo apoptosis and die, or survive at the cost of living with a modified genome. An increase 
in tolerance to damage can lead to an increased rate of survival which will allow a greater accumulation of 
mutations. Yeast Revl and human polymerase r| are members of [Y family translesion DNA polymerases present 
during global response to DNA damage and are responsible for enhanced mutagenesis during a global response to 
DNA damage in eukaryotes. 
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Pathological effects of poor DNA repair 

Experimental animals with genetic deficiencies in DNA repair often 
show decreased life span and increased cancer incidence J 12 ^ For 
example, mice deficient in the dominant NHEJ pathway and in 
telomere maintenance mechanisms get lymphoma and infections more 
often, and consequently have shorter life spans than wild-type mice. 
Similarly, mice deficient in a key repair and transcription protein that 
unwinds DNA helices have premature onset of aging-related diseases 
and consequent shortening of life span. However, not every DNA 
repair deficiency creates exactly the predicted effects; mice deficient in 
the NER pathway exhibited shortened life span without 
correspondingly higher rates of mutation 
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cell pathology 



If the rate of DNA damage exceeds the capacity of the cell to repair it, the accumulation of errors can overwhelm the 

cell and result in early senescence, apoptosis or cancer. Inherited diseases associated with faulty DNA repair 

functioning result in premature aging, increased sensitivity to carcinogens, and correspondingly increased cancer 

risk (see below). On the other hand, organisms with enhanced DNA repair systems, such as Deinococcus 

radiodurans, the most radiation-resistant known organism, exhibit remarkable resistance to the double strand 
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break-inducing effects of radioactivity, likely due to enhanced efficiency of DNA repair and especially NHEJ. 
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Longevity and caloric restriction 



A number of individual genes have been identified as influencing 
variations in life span within a population of organisms. The effects of 
these genes is strongly dependent on the environment, particularly on 
the organism's diet. Caloric restriction reproducibly results in extended 
life span in a variety of organisms, likely via nutrient sensing pathways 
and decreased metabolic rate. The molecular mechanisms by which 
such restriction results in lengthened life span are as yet unclear 
(see [36] for some discussion); however, the behavior of many genes 
known to be involved in DNA repair is altered under conditions of 
caloric restriction. 
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For example, increasing the gene dosage of the gene SIR-2, which 

regulates DNA packaging in the nematode worm Caenorhabditis 
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elegans, can significantly extend life span. The mammalian 
homolog of SIR-2 is known to induce downstream DNA repair factors 
involved in NHEJ, an activity that is especially promoted under 
conditions of caloric restriction. Caloric restriction has been closely 
linked to the rate of base excision repair in the nuclear DNA of 
rodents, ^ although similar effects have not been observed in mitochondrial DNAJ 40 ^ 



Most life span influencing genes affect the rate of 
DNA damage 



Interestingly, the C. elegans gene AGE-1, an upstream effector of DNA repair pathways, confers dramatically 
extended life span under free-feeding conditions but leads to a decrease in reproductive fitness under conditions of 
caloric restriction/ 411 This observation supports the pleiotropy theory of the biological origins of aging, which 
suggests that genes conferring a large survival advantage early in life will be selected for even if they carry a 
corresponding disadvantage late in life. 



Medicine and DNA repair modulation 
Hereditary DNA repair disorders 

Defects in the NER mechanism are responsible for several genetic disorders, including: 

• xeroderma pigmentosum: hypersensitivity to sunlight/UV, resulting in increased skin cancer incidence and 
premature aging 

• Cockayne syndrome: hypersensitivity to UV and chemical agents 

• trichothiodystrophy: sensitive skin, brittle hair and nails 

Mental retardation often accompanies the latter two disorders, suggesting increased vulnerability of developmental 
neurons. 

Other DNA repair disorders include: 

• Werner's syndrome: premature aging and retarded growth 

• Bloom's syndrome: sunlight hypersensitivity, high incidence of malignancies (especially leukemias). 

• ataxia telangiectasia: sensitivity to ionizing radiation and some chemical agents 

All of the above diseases are often called "segmental progerias" ("accelerated aging diseases") because their victims 
appear elderly and suffer from aging-related diseases at an abnormally young age, while not manifesting all the 
symptoms of old age. 

Other diseases associated with reduced DNA repair function include Fanconi's anemia, hereditary breast cancer and 
hereditary colon cancer. 
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DNA repair and cancer 

Inherited mutations that affect DNA repair genes are strongly associated with high cancer risks in humans. 
Hereditary nonpolyposis colorectal cancer (HNPCC) is strongly associated with specific mutations in the DNA 
mismatch repair pathway. BRCA1 and BRCA2, two famous mutations conferring a hugely increased risk of breast 
cancer on carriers, are both associated with a large number of DNA repair pathways, especially NHEJ and 
homologous recombination. 

Cancer therapy procedures such as chemotherapy and radiotherapy work by overwhelming the capacity of the cell to 
repair DNA damage, resulting in cell death. Cells that are most rapidly dividing - most typically cancer cells - are 
preferentially affected. The side effect is that other non-cancerous but rapidly dividing cells such as stem cells in the 
bone marrow are also affected. Modern cancer treatments attempt to localize the DNA damage to cells and tissues 
only associated with cancer, either by physical means (concentrating the therapeutic agent in the region of the tumor) 
or by biochemical means (exploiting a feature unique to cancer cells in the body). 

DNA repair and evolution 

The basic processes of DNA repair are highly conserved among both prokaryotes and eukaryotes and even among 
bacteriophage (viruses that infect bacteria); however, more complex organisms with more complex genomes have 
correspondingly more complex repair mechanisms J 42 ^ The ability of a large number of protein structural motifs to 
catalyze relevant chemical reactions has played a significant role in the elaboration of repair mechanisms during 
evolution. For an extremely detailed review of hypotheses relating to the evolution of DNA repair, seeJ 43] 

The fossil record indicates that single celled life began to proliferate on the planet at some point during the 
Precambrian period, although exactly when recognizably modern life first emerged is unclear. Nucleic acids became 
the sole and universal means of encoding genetic information, requiring DNA repair mechanisms that in their basic 
form have been inherited by all extant life forms from their common ancestor. The emergence of Earth's oxygen-rich 
atmosphere (known as the "oxygen catastrophe") due to photosynthetic organisms, as well as the presence of 
potentially damaging free radicals in the cell due to oxidative phosphorylation, necessitated the evolution of DNA 
repair mechanisms that act specifically to counter the types of damage induced by oxidative stress. 

Rate of evolutionary change 

On some occasions, DNA damage is not repaired, or is repaired by an error-prone mechanism which results in a 
change from the original sequence. When this occurs, mutations may propagate into the genomes of the cell's 
progeny. Should such an event occur in a germ line cell that will eventually produce a gamete, the mutation has the 
potential to be passed on to the organism's offspring. The rate of evolution in a particular species (or, more narrowly, 
in a particular gene) is a function of the rate of mutation. Consequently, the rate and accuracy of DNA repair 
mechanisms have an influence over the process of evolutionary change J 44 ^ 

See also 

• Direct DNA damage 

• Indirect DNA damage 

• DNA damage theory of aging 

• Accelerated aging disease 

• Aging DNA 

• Cell cycle 

• DNA replication 

• Gene therapy 

• Life extension 
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• Human mitochondrial genetics 

• Progeria 

• Senescence 

• The scientific journal DNA Repair under Mutation Research 



External links 

Roswell Park Cancer Institute DNA Repair Lectures ^ 
DNA Repair - A summary of the primary mechanisms ^ 
A comprehensive list of Human DNA Repair Genes ^ 
3D structures of some DNA repair enzymes t48] 
Human DNA repair diseases ^ 
DNA repair special interest group t50] 
DNA Repair [51] 

DNA Damage and DNA Repair [52] 
T531 

Segmental Progeria 

DNA-damage repair; the good, the bad, and the ugly t54] 
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article is part of the series on: 
Gene expression 

a Molecular biology topic (portal) 
(Glossary) 

Introduction to Genetics 

General flow: DNA > RNA > Protein 

special transfers (RNA > RNA, 
RNA > DNA, Protein > Protein) 

Genetic code 

Transcription 

Transcription (Transcription factors, 
RNA Polymerase, promoter) Prokaryotic / Archaeal / Eukaryotic 

post-transcriptional modification 
(hnRNA,Splicing) 

Translation 

Translation (Ribosome,tRNA) Prokaryotic / Archaeal / Eukaryotic 

post-translational modification 
(functional groups, peptides, 
structural changes) 

gene regulation 

epigenetic regulation 
(Genomic imprinting) 

transcriptional regulation 

post-transcriptional regulation 

(sequestration, 
alternative splicing, miRNA) 

translational regulation 

post-translational regulation 

(reversible, irreversible) 

ask a question ^ , edit ^ 



Translation is the first stage of protein biosynthesis (part of the overall process of gene expression). In translation, 
messenger RNA (mRNA) produced in transcription is decoded to produce a specific amino acid chain, or 
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polypeptide, that will later fold into an active protein. Translation occurs in the cell's cytoplasm, where the large and 
small subunits of the ribosome are located, and bind to the mRNA. The ribosome facilitates decoding by inducing 
the binding of tRNAs with complementary anticodon sequences to that of the mRNA. The tRNAs carry specific 
amino acids that are chained together into a polypeptide as the mRNA passes through and is "read" by the ribosome 
in a fashion reminiscent to that of a stock ticker and ticker tape. 

In many instances, the entire ribosome/mRNA complex will bind to the outer membrane of the rough endoplasmic 
reticulum and release the nascent protein polypeptide inside for later vesicle transport and secretion outside of the 
cell. Many types of transcribed RNA, such as transfer RNA, ribosomal RNA, and small nuclear RNA, do not 
undergo translation into proteins. 

Translation proceeds in four phases: activation, initiation, elongation and termination (all describing the growth of 
the amino acid chain, or polypeptide that is the product of translation). Amino acids are brought to ribosomes and 
assembled into proteins. 

In activation, the correct amino acid is covalently bonded to the correct transfer RNA (tRNA). The amino acid is 
joined by its carboxyl group to the 3' OH of the tRNA by a peptide bond. When the tRNA has an amino acid linked 
to it, it is termed "charged". Initiation involves the small subunit of the ribosome binding to 5' end of mRNA with the 
help of initiation factors (IF). Termination of the polypeptide happens when the A site of the ribosome faces a stop 
codon (UAA, UAG, or UGA). No tRNA can recognize or bind to this codon. Instead, the stop codon induces the 
binding of a release factor protein that prompts the disassembly of the entire ribosome/mRNA complex. 

A number of antibiotics act by inhibiting translation; these include anisomycin, cycloheximide, chloramphenicol, 
tetracycline, streptomycin, erythromycin, and puromycin, among others. Prokaryotic ribosomes have a different 
structure from that of eukaryotic ribosomes, and thus antibiotics can specifically target bacterial infections without 
any detriment to a eukaryotic host's cells. 

Basic mechanisms 

See articles at prokaryotic translation and eukaryotic translation 

The mRNA carries genetic information 
encoded as a ribonucleotide sequence from 
the chromosomes to the ribosomes. The 
ribonucleotides are "read" by translational 
machinery in a sequence of nucleotide 
triplets called codons. Each of those triplets 
codes for a specific amino acid. 

The ribosome molecules translate this code 
to a specific sequence of amino acids. The 
ribosome is a multisubunit structure 
containing rRNA and proteins. It is the 
"factory" where amino acids are assembled 
into proteins. tRNAs are small noncoding 
RNA chains (74-93 nucleotides) that 
transport amino acids to the ribosome. 
tRNAs have a site for amino acid 
attachment, and a site called an anticodon. The anticodon is an RNA triplet complementary to the mRNA triplet that 
codes for their cargo amino acid. 

Aminoacyl tRNA synthetase (an enzyme) catalyzes the bonding between specific tRNAs and the amino acids that 
their anticodons sequences call for. The product of this reaction is an aminoacyl-tRNA molecule. This 




Diagram showing the translation of mRNA and the synthesis of proteins by a 

ribosome 
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aminoacyl-tRNA travels inside the ribosome, where mRNA codons are matched through complementary base 
pairing to specific tRNA anticodons. The amino acids that the tRNAs carry are then used to assemble a protein. The 
energy required for translation of proteins is significant. For a protein containing n amino acids, the number of 
high-energy Phosphate bonds required to translate it is 4n+l . The rate of translation varies; it is significantly higher 
in prokaryotic cells (up to 17-21 amino acid residues per second) than in eukaryotic cells (up to 6-9amino acid 
residues per second) ^ 



Genetic code 

Whereas other aspects such as the 3D structure, called tertiary structure, of protein can only be predicted using 
sophisticated algorithms, the amino acid sequence, called primary structure, can be determined solely from the 
nucleic acid sequence with the aid of a translation table. 

This approach may not give the correct amino acid composition of the protein, in particular if unconventional amino 
acids such as selenocysteine are incorporated into the protein, which is coded for by a conventional stop codon in 
combination with a downstream hairpin (SElenoCysteine Insertion Sequence, or SECIS). 

There are many computer programs capable of translating a DNA/RNA sequence into a protein sequence. Normally 
this is performed using the Standard Genetic Code; many bioinformaticians have written at least one such program at 
some point in their education. However, few programs can handle all the "special" cases, such as the use of the 
alternative initiation codons. For example, the rare alternative start codon CTG codes for Methionine when used as a 
start codon, and for Leucine in all other positions. 

Example: Condensed translation table for the Standard Genetic Code (from the NCBI Taxonomy webpage ^). 

AAs = 

FFLLSSSSYY* *CC*WLLLLPPPPHHQQRRRRI I IMTTTTNNKKSSRRVVVVAAAADDEEGGGG 
Starts = 

M M M 

Basel = 

TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG 
Base2 = 

TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG 
Base3 = 

TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG 

Translation tables 

Even when working with ordinary Eukaryotic sequences such as the Yeast genome, it is often desired to be able to 
use alternative translation tables — namely for translation of the mitochondrial genes. Currently the following 
translation tables are defined by the NCBI Taxonomy Group ^ for the translation of the sequences in GenBank: 



The Standard 

The Vertebrate Mitochondrial Code 

The Yeast Mitochondrial Code 

The Mold, Protozoan, and Coelenterate Mitochondrial Code and the 



Mycoplasma/Spiroplasma Code 



5 
6 
9 
10 
11 



The Invertebrate Mitochondrial Code 

The Ciliate, Dasycladacean and Hexamita Nuclear Code 

The Echinoderm and Flatworm Mitochondrial Code 

The Euplotid Nuclear Code 

The Bacterial and Plant Plastid Code 
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12: The Alternative Yeast Nuclear Code 

13: The Ascidian Mitochondrial Code 

14: The Alternative Flatworm Mitochondrial Code 

15: Blepharisma Nuclear Code 

16: Chlorophycean Mitochondrial Code 

21: Trematode Mitochondrial Code 

22: Scenedesmus obliquus mitochondrial Code 

23: Thraustochytrium Mitochondrial Code 

Software examples 

• ApE [6] (Mac, Windows, Unix) 

• Serial Cloner (A DNA editing and manipulating software for MacOS and Windows) 

• DNA Strider (Mac) 

T81 

• ExPASy Translate Tool (webserver) 

mi 

• Virtual Ribosome (webserver, cross-platform command-line) 

• DNA to protein translation ^ (webserver, 13 genomic codes or custom ones) 

Example of computational translation - notice the indication of (alternative) start-codons: 

VIRTUAL RIBOSOME 

Translation table: Standard SGCO 
>Seql 

Reading frame : 1 

MVLSAADKGNVKAAWGKVGGHAA 
E Y G A E A L 

5 ' 

ATGGTGCTGTCTGCCGCCGACAAGGGCAATGTCAAGGCCGCCTGGGGCAAGGTTGGCGGCCACGCTGCAGAGTATGGCGCAGAGGCCCT 
90 

»>. . . ) ) ) ) ) 



ERMFLSFPTTKTYFPHFDLSHGS 

A Q V K G H G 

5' 

GAGAGGATGTTCCTGAGCTTCCCCACCACCAAGACCTACTTCCCCCACTTCGACCTGAGCCACGGCTCCGCGCAGGTCAAGGGCCACGG 
180 

»>...))) ) ) ) 



AKVAAALTKAVEHLDDLPGALSE 

L S D L H A H 

5 ' 

GCGAAGGTGGCCGCCGCGCTGACCAAAGCGGTGGAACACCTGGACGACCTGCCCGGTGCCCTGTCTGAACTGAGTGACCTGCACGCTC£ 
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270 

) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) 



KLRVDPVNFKLLSHSLLVTLASH 

L P S D F T P 

5' 

AAGCTGCGTGTGGACCCGGTCAACTTCAAGCTTCTGAGCCACTCCCTGCTGGTGACCCTGGCCTCCCACCTCCCCAGTGATTTCACCCC 
360 

...))) ) ) ) )))))) ) ) ) 



AVHASLDKFLANVSTVLTSKYR* 

5 ' 

GCGGTCCACGCCTCCCTGGACAAGTTCTTGGCCAACGTGAGCACCGTGCTGACCTCCAAATACCGTTAA 
429 

) ) ) ) ) ) ) ) ) *** 

Annotation key: 

>>> : START codon (strict) 

) ) ) : START codon (alternative) 

k ~k k • STOP 

See also 

• Expanded genetic code 

• Protein methods 
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DNA Transcription 




article is part of the series on: 
Gene expression 

a Molecular biology topic (portal) 
(Glossary) 

Introduction to Genetics 

General flow: DNA > RNA > Protein 

special transfers (RNA > RNA, 
RNA > DNA, Protein > Protein) 

Genetic code 

Transcription 

Transcription (Transcription factors, 
RNA Polymerase, promoter) Prokaryotic / Archaeal / Eukaryotic 

post-transcriptional modification 
(hnRNA,Splicing) 

Translation 

Translation (Ribosome,tRNA) Prokaryotic / Archaeal / Eukaryotic 

post-translational modification 
(functional groups, peptides, 
structural changes) 

gene regulation 

epigenetic regulation 
(Genomic imprinting) 

transcriptional regulation 

post-transcriptional regulation 

(sequestration, 
alternative splicing, miRNA) 

translational regulation 

post-translational regulation 

(reversible, irreversible) 

ask a question ^ , edit ^ 

Transcription, or RNA synthesis, is the process of creating an equivalent RNA copy of a sequence of DNA^ . 
Both RNA and DNA are nucleic acids, which use base pairs of nucleotides as a complementary language that can be 
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converted back and forth from DNA to RNA in the presence of the correct enzymes. During transcription, a DNA 
sequence is read by RNA polymerase, which produces a complementary, antiparallel RNA strand. As opposed to 
DNA replication, transcription results in an RNA complement that includes uracil (U) in all instances where thymine 
(T) would have occurred in a DNA complement. 

Transcription is the first step leading to gene expression. The stretch of DNA transcribed into an RNA molecule is 
called a transcription unit and encodes at least one gene. If the gene transcribed encodes for a protein, the result of 
transcription is messenger RNA (mRNA), which will then be used to create that protein via the process of 
translation. Alternatively, the transcribed gene may encode for either ribosomal RNA (rRNA) or transfer RNA 
(tRNA), other components of the protein-assembly process, or other ribozymes. 

A DNA transcription unit encoding for a protein contains not only the sequence that will eventually be directly 
translated into the protein (the coding sequence) but also regulatory sequences that direct and regulate the synthesis 
of that protein. The regulatory sequence before (upstream from) the coding sequence is called the five prime 
untranslated region (5'UTR), and the sequence following (downstream from) the coding sequence is called the three 
prime untranslated region (3'UTR). 

Transcription has some proofreading mechanisms, but they are fewer and less effective than the controls for copying 
DNA; therefore, transcription has a lower copying fidelity than DNA replication. 

As in DNA replication, DNA is read from 3' — » 5' during transcription. Meanwhile, the complementary RNA is 
created from the 5' —> 3' direction. Although DNA is arranged as two antiparallel strands in a double helix, only one 
of the two DNA strands, called the template strand, is used for transcription. This is because RNA is only 
single-stranded, as opposed to double- stranded DNA. The other DNA strand is called the coding strand, because its 
sequence is the same as the newly created RNA transcript (except for the substitution of uracil for thymine). The use 
of only the 3' — > 5' strand eliminates the need for the Okazaki fragments seen in DNA replication. 

Transcription is divided into 5 stages: pre -initiation, initiation, promoter clearance, elongation and termination. 



Major steps 
Pre-initiation 

In eukaryotes, RNA polymerase, and therefore the initiation of transcription, requires the presence of a core 
promoter sequence in the DNA. Promoters are regions of DNA which promote transcription and in eukaryotes, are 
found at -30, -75 and -90 base pairs upstream from the start site of transcription. Core promoters are sequences 
within the promoter which are essential for transcription initiation. RNA polymerase is able to bind to core 
promoters in the presence of various specific transcription factors. 

The most common type of core promoter in eukaryotes is a short DNA sequence known as a TATA box, found -30 
base pairs from the start site of transcription. The TATA box, as a core promoter, is the binding site for a 
transcription factor known as TATA binding protein (TBP), which is itself a subunit of another transcription factor, 
called Transcription Factor II D (TFIID). After TFIID binds to the TATA box via the TBP, five more transcription 
factors and RNA polymerase combine around the TATA box in a series of stages to form a preinitiation complex. 
One transcription factor, DNA helicase, has helicase activity and so is involved in the separating of opposing strands 
of double-stranded DNA to provide access to a single-stranded DNA template. However, only a low, or basal, rate of 
transcription is driven by the preinitiation complex alone. Other proteins known as activators and repressors, along 
with any associated coactivators or corepressors, are responsible for modulating transcription rate. 

The transcription preinitiation in archaea, formerly a domain of prokaryote, is essentially homologous to that of 
eukaryotes, but is much less complex. The archaeal preinitiation complex assembles at a TATA-box binding site; 
however, in archaea, this complex is composed of only RNA polymerase II, TBP, and TFB (the archaeal homologue 
of eukaryotic transcription factor II B (TFIIB)).^ ^ 
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Initiation 
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Simple diagram of transcription initiation. RNAP = RNA polymerase 



In bacteria, a domain of prokaryotes, 
transcription begins with the binding of 
RNA polymerase to the promoter in 
DNA. RNA polymerase is a core 
enzyme consisting of five subunits: 2 a 
subunits, 1 (3 subunit, 1 |3' subunit, and 
1 co subunit. At the start of initiation, 
the core enzyme is associated with a 
sigma factor (number 70) that aids in 

finding the appropriate -35 and -10 base pairs downstream of promoter sequences. 

Transcription initiation is more complex in eukaryotes. Eukaryotic RNA polymerase does not directly recognize the 
core promoter sequences. Instead, a collection of proteins called transcription factors mediate the binding of RNA 
polymerase and the initiation of transcription. Only after certain transcription factors are attached to the promoter 
does the RNA polymerase bind to it. The completed assembly of transcription factors and RNA polymerase bind to 
the promoter, forming a transcription initiation complex. Transcription in the archaea domain is similar to 
transcription in eukaryotes ^ 



Promoter clearance 

After the first bond is synthesized, the RNA polymerase must clear the promoter. During this time there is a 
tendency to release the RNA transcript and produce truncated transcripts. This is called abortive initiation and is 
common for both eukaryotes and prokaroytes . Abortive initiation continues to occur until the a factor rearranges, 
resulting in the transcription elongation complex (which gives a 35 bp moving footprint). The a factor is released 
before 80 nucleotides of mRNA are synthesized . Once the transcript reaches approximately 23 nucleotides, it no 
longer slips and elongation can occur. This, like most of the remainder of transcription, is an energy-dependent 
process, consuming adenosine triphosphate (ATP). 

Promoter clearance coincides with phosphorylation of serine 5 on the carboxy terminal domain of RNA Pol in 
eukaryotes, which is phosphorylated by TFIIH. 



Elongation 
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Simple diagram of transcription elongation 



One strand of DNA, the template 
strand (or noncoding strand), is used as 
a template for RNA synthesis. As 
transcription proceeds, RNA 
polymerase traverses the template 
strand and uses base pairing 

complementarity with the DNA template to create an RNA copy. Although RNA polymerase traverses the template 
strand from 3' — > 5', the coding (non-template) strand and newly-formed RNA can also be used as reference points, 
so transcription can be described as occurring 5' — > 3'. This produces an RNA molecule from 5' — > 3', an exact copy 
of the coding strand (except that thymines are replaced with uracils, and the nucleotides are composed of a ribose 
(5-carbon) sugar where DNA has deoxyribose (one less oxygen atom) in its sugar-phosphate backbone). 



Unlike DNA replication, mRNA transcription can involve multiple RNA polymerases on a single DNA template and 
multiple rounds of transcription (amplification of particular mRNA), so many mRNA molecules can be rapidly 
produced from a single copy of a gene. 
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Elongation also involves a proofreading mechanism that can replace incorrectly incorporated bases. In eukaryotes, 
this may correspond with short pauses during transcription that allow appropriate RNA editing factors to bind. These 
pauses may be intrinsic to the RNA polymerase or due to chromatin structure. 



Termination 



Bacteria use two different strategies for 
transcription termination. In 
Rho-independent transcription 
termination, RNA transcription stops 
when the newly synthesized RNA 
molecule forms a G-C rich hairpin 
loop followed by a run of U's, which 
makes it detach from the DNA 
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Simple diagram of transcription termination 
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template. In the "Rho-dependent" type of termination, a protein factor called "Rho" destabilizes the interaction 
between the template and the mRNA, thus releasing the newly synthesized mRNA from the elongation complex. 

Transcription termination in eukaryotes is less understood but involves cleavage of the new transcript followed by 
template-independent addition of As at its new 3' end, in a process called polyadenylation. 



Measuring and detecting transcription 

Transcription can be measured and detected in a variety of ways: 

• Nuclear Run-on assay: measures the relative abundance of newly formed transcripts 

• RNase protection assay and ChlP-Chip of RNAP: detect active transcription sites 

• RT-PCR: measures the absolute abundance of total or nuclear RNA levels, which may however differ from 
transcription rates 

• DNA microarrays: measures the relative abundance of the global total or nuclear RNA levels; however, these may 
differ from transcription rates 

• In situ hybridization: detects the presence of a transcript 

• MS2 tagging: by incorporating RNA stem loops, such as MS2, into a gene, these become incorporated into newly 
synthesized RNA. The stem loops can then be detected using a fusion of GFP and the MS2 coat protein, which 
has a high affinity, sequence specific interaction with the MS2 stem loops. The recruitment of GFP to the site of 
transcription is visualised as a single fluorescent spot. This remarkable new approach has revealed that 
transcription occurs in discontinuous bursts, or pulses (see Transcriptional bursting). With the notable exception 
of in situ techniques, most other methods provide cell population averages, and are not capable of detecting this 

rm 

fundamental property of genes . 

• Northern blot: the traditional method, and until the advent of RNA-Seq, the most quantitative 

• RNA-Seq: applies next-generation sequencing techniques to sequence whole transcriptomes, which allows the 
measurement of relative abundance of RNA, as well as the detection of additional variations such as fusion genes, 
post-translational edits and novel splice sites 



Transcription factories 

Active transcription units are clustered in the nucleus, in discrete sites called transcription factories or euchromatin. 
Such sites can be visualized by allowing engaged polymerases to extend their transcripts in tagged precursors 
(Br-UTP or Br-U) and immuno-labeling the tagged nascent RNA. Transcription factories can also be localized using 
fluorescence in situ hybridization or marked by antibodies directed against polymerases. There are -10,000 factories 
in the nucleoplasm of a HeLa cell, among which are -8,000 polymerase II factories and -2,000 polymerase III 



DNA Transcription 



104 



factories. Each polymerase II factory contains ~8 polymerases. As most active transcription units are associated with 
only one polymerase, each factory usually contains ~8 different transcription units. These units might be associated 
through promoters and/or enhancers, with loops forming a cloud' around the factor. 



History 

A molecule which allows the genetic material to be realized as a protein was first hypothesized by Francois Jacob 
and Jacques Monod. RNA synthesis by RNA polymerase was established in vitro by several laboratories by 1965; 
however, the RNA synthesized by these enzymes had properties that suggested the existence of an additional factor 
needed to terminate transcription correctly. 

In 1972, Walter Fiers became the first person to actually prove the existence of the terminating enzyme. 

Roger D. Kornberg won the 2006 Nobel Prize in Chemistry "for his studies of the molecular basis of eukaryotic 
transcription".^ 



Reverse transcription 



Target DNA 
^\ Reverse transcription 



Some viruses (such as HIV, the cause of 
AIDS), have the ability to transcribe RNA 
into DNA. HIV has an RNA genome that is 
duplicated into DNA. The resulting DNA 
can be merged with the DNA genome of the 
host cell. The main enzyme responsible for 
synthesis of DNA from an RNA template is 
called reverse transcriptase. In the case of 
HIV, reverse transcriptase is responsible for 
synthesizing a complementary DNA strand 
(cDNA) to the viral RNA genome. An 
associated enzyme, ribonuclease H, digests 
the RNA strand, and reverse transcriptase 
synthesises a complementary strand of DNA 
to form a double helix DNA structure. This 
cDNA is integrated into the host cell's 

genome via another enzyme (integrase) causing the host cell to generate viral proteins which reassemble into new 
viral particles. Subsequently, the host cell undergoes programmed cell death, apoptosis. 




Viral DNA 




Scheme of reverse transcription 



Some eukaryotic cells contain an enzyme with reverse transcription activity called telomerase. Telomerase is a 
reverse transcriptase that lengthens the ends of linear chromosomes. Telomerase carries an RNA template from 
which it synthesizes DNA repeating sequence, or "junk" DNA. This repeated sequence of DNA is important because 
every time a linear chromosome is duplicated it is shortened in length. With "junk" DNA at the ends of 
chromosomes, the shortening eliminates some of the non-essential, repeated sequence rather than the 
protein-encoding DNA sequence farther away from the chromosome end. Telomerase is often activated in cancer 
cells to enable cancer cells to duplicate their genomes indefinitely without losing important protein-coding DNA 
sequence. Activation of telomerase could be part of the process that allows cancer cells to become technically 
immortal. However, the true in vivo significance of telomerase has still not been empirically proven. 
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See also 

• Genetics 

• Molecular biology 

• Translation - process of decoding RNA to form polypeptides. 

• Splicing - process of removing introns from precursor messenger RNA (pre-mRNA) to make form messenger 
RNA (mRNA). 

• Reverse transcription - process viruses use to make DNA from RNA 

• Crick's central dogma - DNA is transcribed to RNA which is translated to polypeptides, never the other way 
around. 



Further reading 

• Lehninger Principles of Biochemistry, 5th edition, David L. Nelson & Michael M. Cox 

• Principles of Nuclear Structure and Function, Peter R. Cook 

• Essential Genetics, Peter J. Russell 



External links 

• Interactive Java simulation of transcription initiation. From Center for Models of Life tl2] at the Niels Bohr 
Institute. 

ri3i 

• Interactive Java simulation of transcription interference— a game of promoter dominance in bacterial virus. 

ri2i 

From Center for Models of Life at the Niels Bohr Institute. 

• Biology animations about this topic under Chapter 15 and Chapter 18 tl4] 
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DNA Transfer 



In molecular biology, transformation is the genetic alteration of a cell resulting from the uptake, genomic 
incorporation, and expression of environmental genetic material (DNA)J 1 ^ Transformation occurs most commonly in 
bacteria, both naturally and artificially, and refers to DNA taken up from the environment through their cell wall. 
Bacteria that are capable of being transformed are called competent. New genetic material can also be transferred to 
cells through conjugation or transduction. Conjugation involves cell-to-cell contact between two different bacterial 
cells, with the DNA being transferred from one bacterial cell to the other. In transduction, viruses called 
bacteriophages inject the foreign DNA into their host. Introduction of foreign DNA into eukaryotic cells is usually 
called "transfection". Transformation is also used to describe the insertion of new genetic material into 
nonbacterial cells including animal and plant cells. 



History 

Transformation was first demonstrated in 1928 by Frederick Griffith, an English bacteriologist searching for a 
vaccine against bacterial pneumonia. Griffith discovered that a harmless strain of Streptococcus pneumoniae could 
be made virulent after being exposed to heat-killed virulent strains. Griffith hypothesized that some "transforming 
factor" from the heat-killed strain was responsible for making the harmless strain virulent. In 1944 this "transforming 
factor" was identified as being genetic by Oswald Avery, Colin MacLeod, and Maclyn McCarty. They isolated DNA 
from virulent strain of S. pneumoniae and using just this DNA were able to make a harmless strain virulent. They 
called this uptake and incorporation of DNA by bacteria "transformation." See Avery-MacLeod-McCarty 
experiment. 

The results of Avery et a/.'s experiments were at first sceptically received by the scientific community, and it was not 
until the development of genetic markers and the discovery of other methods of genetic transfer (conjugation in 1947 
and transduction in 1953) by Joshua Lederberg that Avery's experiments were accepted. Transformation did not 
become routine procedure in scientific laboratories until 1972, when Stanley Cohen, Annie Chang and Leslie Hsu 
successfully transformed Escherichia coli by treating the bacteria with calcium chloride. 1 ^ This created an efficient 
and convenient procedure for transforming DNA into bacteria and opened the way for molecular cloning in 
biotechnology and research. 

Transformation using electroporation was developed in the late 1980s, increasing the efficiency and number of 
bacterial strains that could be transformed.^ Transformation of other animal and plant cells were investigated, with 
the first transgenic mouse being created by injecting genetic material that included a rat growth hormone gene into a 
mouse embryo in 1982.^ In 1907 a bacterium that caused plant tumors, Agrobacterium tumefaciens, was discovered 
and in the early 1970s the Tumour Inducing agent was found to be a DNA plasmid, called the Ti plasmid. 1 By 
removing the genes in the plasmid that caused the cancer and adding in novel genes, researchers were able to infect 
plants with A. tumefaciens and let the bacteria insert their chosen DNA into the plants genome. Not all plant cells are 
susceptible to infection by A. tumefaciens so other methods were developed, including electroporation and 

roi 

micro-injection. Particle bombardment was made possible with the invention of the Biolistic Particle Delivery 

rm 

System (gene gun) by John Sanford in 1990. 
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Mechanisms 
Bacteria 

Bacteria transformation may be referred to as a stable genetic change brought about by taking up naked DNA (DNA 
without associated cells or proteins), and competence refers to the state of being able to take up exogenous DNA 
from the environment. Two different forms of competence should be distinguished: natural and artificial. 

Natural competence 

Some bacteria (around 1% of all species) are naturally capable of taking up DNA under laboratory conditions; many 
more may be able to take it up in their natural environments. Such species carry sets of genes specifying the cause of 
the machinery for bringing DNA across the cell's membrane or membranes J 10 ^ 

Artificial competence 

Artificial competence is not encoded in the cell's genes. Instead it is induced by laboratory procedures in which cells 
are passively made permeable to DNA, using conditions that do not normally occur in natureJ 11 ^ 

Calcium chloride transformation is a method of promoting competence. Chilling cells in the presence of divalent 
2+ 

cations such as Ca (in CaCl 2 ) prepares the cell membrane to become permeable to plasmid DNA. Cells are 
incubated on ice with the DNA and then briefly heat shocked (e.g. 42 °C for 30-120 seconds), which causes the 

DNA to enter the cell. This method works very well for circular plasmid DNAs. An excellent preparation of 

8 4 
competent cells will give -10 colonies per microgram of plasmid. A poor preparation will be about 10 /\xg or less. 

Good non-commercial preps should give 10 5 to 10 6 transformants per microgram of plasmid. 

The method usually does not work well for linear molecules such as fragments of chromosomal DNA, probably 
because exonuclease enzymes in the cell rapidly degrade linear DNA. However, cells that are naturally competent 
are usually transformed more efficiently with linear DNA than with plasmids. 

Electroporation is another way to make holes in bacterial (and other) cells, by briefly shocking them with an electric 

field of 10-20kV/cm. Plasmid DNA can enter the cell through these holes. This method is amenable to use with large 
ri2i 

plasmid DNA. L Natural membrane-repair mechanisms will rapidly close these holes after the shock. 
Plasmid transformation 

In order to persist and be stably maintained in the cell, a plasmid DNA molecule must contain an origin of 
replication, which allows it to be replicated in the cell independently of the chromosome. Because transformation 
usually produces a mixture of rare transformed cells and abundant non-transformed cells, a method is needed to 
identify the cells that have acquired the plasmid. Plasmids used in transformation experiments will usually also 
contain a gene giving resistance to an antibiotic that the intended recipient strain of bacteria is sensitive to. Cells able 
to grow on media containing this antibiotic will have been transformed by the plasmid, as cells lacking the plasmid 
will be unable to grow. 

Another marker, used for identifying E. coli cells that have acquired recombinant plasmids, is the lacZ gene, which 
codes for (3-galactosidase. Because (3-galactosidase is a homo-tetramer, with each monomer made up of one lacZ-a 
and one lacZ-co protein, if only one of the two requisite proteins is expressed in the resulting cell, no functional 
enzyme will be formed. Thus, if a strain of E. coli without lacZ-a in its genome is transformed using a plasmid 
containing the missing gene fragment, transformed cells will produce (3-galactosidase, while untransformed cells will 
not, as they are only able to produce the omega half of the monomer. In this type of transformation, the polylinker 
region of the plasmid lies in the lacZ-a gene fragment, meaning that successfully produced recombinant plasmids 
will have the desired gene inserted somewhere within lacZ-a. When this disrupted gene fragment is expressed by E. 
coli, no usable lacZ-a protein is produced, and therefore no usable (3-galactosidase is formed. When grown on media 
containing the colorless, modified galactose sugar X-gal, colonies that are able to metabolize the substrate (and that 
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have therefore been transformed, but not by recombinant plasmids) will appear blue in color; colonies that are not 
able to metabolize the substrate (and that have therefore been transformed by recombinant plasmids) will appear 
white. 

Plants 

A number of mechanisms are available to transfer DNA into plant cells: 

• Agrobacterium mediated transformation is the easiest and most simple plant transformation. Plant tissue (often 
leaves) is cut into small pieces, e.g. 10x1 0mm, and soaked for 10 minutes in a fluid containing suspended 
Agrobacterium. Some cells along the cut will be transformed by the bacterium, that inserts its DNA into the cell. 
Placed on selectable rooting and shooting media, the plants will regrow. Some plants species can be transformed 
just by dipping the flowers into suspension of Agrobacterium and then planting the seeds in a selective medium. 
Unfortunately, many plants are not transformable by this method. 

• Particle bombardment: Particles of gold or tungsten are coated with DNA and then shot into young plant cells or 
plant embryos. Some genetic material will stay in the cells and transform them. This method also allows 
transformation of plant plastids. The transformation efficiency is lower than in agribacterial mediated 
transformation, but most plants can be transformed with this method. 

• Electroporation: make transient holes in cell membranes using electric shock; this allows DNA to enter as 
described above for Bacteria. 

• Viral transformation (transduction): Package the desired genetic material into a suitable plant virus and allow this 
modified virus to infect the plant. If the genetic material is DNA, it can recombine with the chromosomes to 
produce transformant cells. However genomes of most plant viruses consist of single stranded RNA which 
replicates in the cytoplasm of infected cell. For such genomes this method is a form of transfection and not a real 
transformation, since the inserted genes never reach the nucleus of the cell and do not integrate into the host 
genome. The progeny of the infected plants is virus free and also free of the inserted gene. 

Animals 

Introduction of DNA into animal cells is usually called transfection, and is discussed in the corresponding article. 

External links 

ri3i 

• Bacterial Transformation (a Flash Animation) 
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Crystallographic structure of HIV reverse transcriptase where the P51 subunit is colored green and the P66 subunit is colored cyan. 
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In the fields of molecular biology and biochemistry, a reverse transcriptase, also known as RNA-dependent DNA 
polymerase, is a DNA polymerase enzyme that transcribes single- stranded RNA into double-stranded DNA. It also 
helps in the formation of a double helix DNA once the RNA has been reverse transcribed into a single strand cDNA. 
Normal transcription involves the synthesis of RNA from DNA; hence, reverse transcription is the reverse of this. 

Reverse transcriptase was discovered by Howard Temin at the University of Wisconsin-Madison, and independently 
by David Baltimore in 1970 at MIT. The two shared the 1975 Nobel Prize in Physiology or Medicine with Renato 
Dulbecco for their discovery. 

Well studied reverse transcriptases include: 

• HIV-1 reverse transcriptase from human immunodeficiency virus type 1 (PDB 1HMV ^ 159 ^) 

• M-MLV reverse transcriptase from the Moloney murine leukemia virus 

• AMV reverse transcriptase from the avian myeloblastosis virus 

• Telomerase reverse transcriptase that maintains the telomeres of eukaryotic chromosomes 



Function in viruses 

The enzyme is encoded and used by reverse-transcribing viruses, which use the enzyme during the process of 
replication. Reverse-transcribing RNA viruses, such as retroviruses, use the enzyme to reverse-transcribe their RNA 
genomes into DNA, which is then integrated into the host genome and replicated along with it. Reverse-transcribing 
DNA viruses, such as the hepadnaviruses, can allow RNA to serve as a template in assembling, and making DNA 
strands. HIV infects humans with the use of this enzyme. Without reverse transcriptase, the viral genome would not 
be able to incorporate into the host cell, resulting in the failure of the ability to replicate. Unlike bacteria, retroviruses 
use preexisting host-encoded transfer RNAs as primers. 
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Process of reverse transcription 

Reverse transcriptase creates single stranded DNA from an RNA template. 

In virus species with reverse transcriptase lacking DNA-dependent DNA polymerase activity, creation of 
double- stranded DNA can possibly be done by host-encoded DNA polymerase 6, mistaking the viral DNA-RNA for 
a primer and synthesizing a double- stranded DNA by similar mechanism as in primer removal, where the newly 
synthesized DNA displaces the original RNA template. 

The process of reverse transcription is extremely error-prone and it is during this step that mutations may occur. 
Such mutations may cause drug resistance. 



Process in class VI viruses 

Class VI viruses ssRNA-RT, also called the retroviruses are RNA 
reverse transcribing viruses with a DNA intermediate. Their 
genomes consist of two molecules of positive sense single 
stranded RNA with a 5' cap and 3' polyadenylated tail. Examples 
of retroviruses include Human Immunodeficiency Virus (HIV) and 
Human T-Lymphotropic virus (HTLV). Creation of 



double- stranded DNA occurs in the cytosol 
steps: 



[160] 



as a series of 




1 . A specific cellular tRNA acts as a primer and hybridizes to a 
complementary part of the virus genome called the primer 
binding site or PBS 

2. Complementary DNA then binds to the U5 (non-coding region) 
and R region (a direct repeat found at both ends of the RNA 
molecule) of the viral RNA 

3. A domain on the reverse transcriptase enzyme called RNAse H 
degrades the 5' end of the RNA which removes the U5 and R 
region 

4. The primer then jumps' to the 3' end of the viral genome and 
the newly synthesised DNA strands hybridizes to the 
complementary R region on the RNA 

5. The first strand of complementary DNA (cDNA) is extended 
and the majority of viral RNA is degraded by RNAse H 

6. Once the strand is completed, second strand synthesis is initiated from the viral RNA 

7. There is then another jump' where the PBS from the second strand hybridizes with the complementary PBS on 
the first strand 

8. Both strands are extended further and can be incorporated into the hosts genome by the enzyme integrase 

Creation of double- stranded DNA also involves strand transfer, in which there is a translocation of short DNA 
product from initial RNA dependent DNA synthesis to acceptor template regions at the other end of the genome, 
which are later reached and processed by the reverse transcriptase for its DNA-dependent DNA activity J 161 ^ 

Retroviral RNA is arranged in 5' terminus to 3' terminus. The site where the primer is annealed to viral RNA is 
called the primer-binding site (PBS). The RNA 5'end to the PBS site is called U5, and the RNA 3' end to the PBS is 
called the leader. The tRNA primer is unwound between 14 and 22 nucleotides and forms a base-paired duplex with 
the viral RNA at PBS. That PBS locates near the 5' terminus of viral RNA is unusual because reverse transcriptase 
synthesize DNA from 3' end of the primer in the 5' to 3' direction. Therefore, the primer and reverse transcriptase 
must be relocated to 3' end of viral RNA. In order to accomplish this reposition, multiple steps and various enzymes 
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including DNA polymerase, ribonuclease H(RNase H) and polynucleotide unwinding are needed. 

The HIV reverse transcriptase also has ribonuclease activity that degrades the viral RNA during the synthesis of 
cDNA, as well as DNA-dependent DNA polymerase activity that copies the sense cDNA strand into an antisense 
DNA to form a double- stranded viral DNA intermediate (vDNA)J 163 ^ 



In eukaryotes 

Self-replicating stretches of eukaryotic genomes known as retrotransposons utilize reverse transcriptase to move 
from one position in the genome to another via a RNA intermediate. They are found abundantly in the genomes of 
plants and animals. Telomerase is another reverse transcriptase found in many eukaryotes, including humans, which 
carries its own RNA template; this RNA is used as a template for DNA replication.^ 164 ^ ^ 1651 



In prokaryotes 

Reverse transcriptases are also found in bacterial Retron msr RNAs, distinct sequences which code for reverse 
transcriptase, and are used in the synthesis of msDNA. In order to initiate synthesis of DNA, a primer is needed. In 
bacteria, the primer is synthesized during replication.^ 1661 



Structure 

Reverse transcriptase enzymes include an RNA-dependent DNA polymerase and a DNA-dependent DNA 
polymerase, which work together to perform transcription. In addition to the transcription function, retroviral reverse 
transcriptases have a domain belonging to the RNase H family which is vital to their replication. 



Replication fidelity 

There are three different replication systems during the life cycle of a retrovirus. First of all, the reverse transcriptase 
synthesize viral DNA from viral RNA, and then from newly made complementary DNA strand. The second 
replication process occurs when host cellular DNA polymerase replicates the integrated viral DNA. Lastly, RNA 
polymerase II transcribes the pro viral DNA into RNA which will be packed into virions. Therefore, mutation can 
occur during one or all of these replication stepsJ 1671 

Reverse transcriptase has a high error rate when transcribing RNA into DNA since, unlike any other DNA 
polymerases, it has no proofreading ability. This high error rate allows mutations to accumulate at an accelerated rate 
relative to proofread forms of replication. The commercially available reverse transcriptases produced by Promega 
are quoted by their manuals as having error rates in the range of 1 in 17,000 bases for AMV and 1 in 30,000 bases 
for M-MLV [168] 
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Applications 



Antiviral drugs 

As HIV uses reverse transcriptase to copy its genetic material and 
generate new viruses (part of a retrovirus proliferation circle), 
specific drugs have been designed to disrupt the process and 
thereby suppress its growth. Collectively, these drugs are known 
as reverse transcriptase inhibitors and include the nucleoside and 
nucleotide analogues zidovudine (trade name Retrovir), 
lamivudine (Epivir) and tenofovir (Viread), as well as 
non-nucleoside inhibitors, such as nevirapine (Viramune). 



HO^ 



N=N=N 




Molecular biology 

Reverse transcriptase is commonly used in research to apply the 
polymerase chain reaction technique to RNA in a technique called 
reverse transcription polymerase chain reaction (RT-PCR). The 
classical PCR technique can be applied only to DNA strands, but, 
with the help of reverse transcriptase, RNA can be transcribed into 

DNA, thus making PCR analysis of RNA molecules possible. Reverse transcriptase is used also to create cDNA 
libraries from mRNA. The commercial availability of reverse transcriptase greatly improved knowledge in the area 
of molecular biology, as, along with other enzymes, it allowed scientists to clone, sequence, and characterise DNA. 



The molecular structure of zidovudine (AZT), a drug 
used to inhibit HIV reverse transcriptase 



History 

The idea of reverse transcription was very unpopular at first as it contradicted the central dogma of molecular 
biology which states that DNA is transcribed into RNA which is then translated into proteins. However, in 1970 
when the scientists Howard Temin and David Baltimore both independently discovered the enzyme responsible for 
reverse transcription, named reverse transcriptase, the possibility that genetic information could be passed on in this 
manner was finally accepted. 

See also 

• cDNA library 

• DNA polymerase 

• msDNA 

• Reverse transcribing virus 

• RNA polymerase 

• Telomerase 

• Retrotransposon marker 
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External links 

• MeSH RNA+Transcriptase [169] 

• animation of reverse transcriptase action and three reverse transcriptase inhibitors ^ 170 ^ 

• Molecule of the month [171] (September 2002) at the Protein Data Bank 

• HIV Replication 3D Medical Animation. (Nov 2008). Video by Boehringer Ingelheim. 



References 

[1] PDB 1HMV (http://www.rcsb.org/pdb/explore/explore.do?structureId=lHMV); Rodgers DW, Gamblin SJ, Harris BA, Ray S, Culp JS, 
Hellmig B, Woolf DJ, Debouck C, Harrison SC (February 1995). "The structure of unliganded reverse transcriptase from the human 
immunodeficiency virus type 1" (http://www. pubmedcentral.nih.gov/articlerender.fcgi ?tool=pmcentrez&artid=42671). Proc. Natl. Acad. 
Sci. U.S.A. 92 (4): 1222-6. PMID 7532306. PMC 42671. 

[2] http://pfam.sanger.ac.uk/family?acc=PF00078 

[3] http://www.ebi. ac.uk/interpro/DisplaylproEntry ?ac=IPR000477 

[4] http://www.expasy.org/cgi-bin/prosite-search-ac7PS50878 

[5] http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?tlev=fa;&pdb=lhmv 

[6] http : / / www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 bqm 

[7] http://www .rcsb. org/ pdb/ cgi/ explore. cgi ?pdbld= 1 bqn 

[ 8 ] http ://www.rcsb.org/pdb/cgi/explore.cgi?pdbId=lcOt 

[9] http : / / www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 cOu 



[10] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


lclb 


[11] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


lclc 


[12] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


IdOe 


[13] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


ldlu 


[14] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


ldlo 


[15] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


ldtq 


[16] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


ldtt 


[17] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


leet 


[18] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


lep4 


[19] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


lfk9 


[20] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


lfko 


[21] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


lfkp 


[22] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


lhar 


[23] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


lhmv 


[24] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


lhni 


[25] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


lhnv 


[26] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


lhpz 


[27] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


lhqe 


[28] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


lhqu 


[29] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


lhvu 


[30] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


lhys 


[31] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


li6j 


[32] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


likv 


[33] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


likw 


[34] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


likx 


[35] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


liky 


[36] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


lj5o 


[37] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


ljkh 


[38] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


ljla 


[39] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


ljlb 


[40] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


ljlc 


[41] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


ljle 


[42] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


ljlf 


[43] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


Ijlg 


[44] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


Ijlq 


[45] 


http://www 


rcsb 


org/pdb/cgi/explore.cgi?pdb!d= 


lklm 


[46] 


http://www 


rcsb 


org/ pdb/ cgi/ explore . q 


*i?pdbld= 


llwO 



Reverse transcriptase 



115 



[47] http ://www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 lw2 

[48] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 lwc 

[49] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 lwe 

[50] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 lwf 

[51] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 mml 

[52] http ://www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 mu2 

[5 3 ] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 n41 

[54] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 n5y 

[55 ] http ://www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 n6q 

[56] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 nnd 

[57] http ://www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 qai 

[58] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 qaj 

[59] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 qe 1 

[60] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 rOa 

[6 1 ] http ://www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 rev 

[62] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 rt 1 

[63 ] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 rt2 

[64] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 rt3 

[65 ] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 rt4 

[66] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 rt5 

[67] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 rt6 

[68] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 rt7 

[69] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 rtd 

[70] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 rth 

[7 1 ] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 rti 

[72] http ://www. rcsb. org/pdb/cgi/explore. cgi ?pdbld=lrtj 

[73 ] http ://www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 rw3 

[74] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 s 1 1 

[75 ] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 s 1 u 

[76] http ://www. rcsb. org/pdb/cgi/explore. cgi ?pdbld=lslv 

[77] http ://www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 s 1 w 

[78] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 s 1 x 

[79] http://www.rcsb.org/pdb/cgi/explore.cgi?pdbId=ls6p 

[80] http ://www. rcsb. org/pdb/cgi/explore. cgi ?pdbld=ls6q 

[81] http ://www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 s9e 

[82] http ://www. rcsb. org/pdb/cgi/explore. cgi ?pdbld=ls9g 

[83] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 suq 

[84] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 s v5 

[85 ] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 103 

[86] http ://www. rcsb. org/pdb/cgi/explore. cgi ?pdbld=lt05 

[ 87 ] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 tkt 

[88] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 tkx 

[89] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 tkz 

[90] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 tl 1 

[9 1 ] http ://www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 tl3 

[92] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 tv6 

[93 ] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 tvr 

[94] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 uwb 

[95 ] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 vrt 

[96] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 vru 

[97] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 ztt 

[98] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld= 1 ztw 

[99] http :// www . rcsb . org/ pdb/ cgi/ explore . cgi ?pdbld=2b5j 
[100] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=2b6a 
[101] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=2b an 
[102] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=2be2 
[103] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=2f j v 
[ 1 04] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=2fj w 
[105] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=2f j x 



Reverse transcriptase 



116 



[106] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=2f vp 

[107] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=2f vq 

[108] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=2f vr 

[109] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=2f vs 

[110] http://www.rcsb. org/pdb/cgi/explore.cgi?pdbId=2hmi 

[111] http://www.rcsb. org/pdb/cgi/explore.cgi?pdbId=2hnd 

[112] http://www.rcsb. org/pdb/cgi/explore.cgi?pdbId=2hny 

[113] http://www.rcsb. org/pdb/cgi/explore.cgi?pdbId=2hnz 

[114] http://www.rcsb. org/pdb/cgi/explore.cgi?pdbId=2i5j 

[115] http://www.rcsb. org/pdb/cgi/explore.cgi?pdbId=2iaj 

[116] http://www.rcsb. org/pdb/cgi/explore.cgi?pdbId=2ic3 

[117] http://www.rcsb. org/pdb/cgi/explore.cgi?pdbId=2opp 

[118] http://www.rcsb. org/pdb/cgi/explore.cgi?pdbId=2opq 

[119] http://www.rcsb. org/pdb/cgi/explore.cgi?pdbId=2opr 

[120] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbId=2ops 

[121] http :// www . rcsb . org/pdb/ cgi/ explore . cgi?pdbld=2r2r 

[122] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=2r2 s 

[123] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=2r2t 

[124] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=2r2u 

[125] http :// www . rcsb . org/pdb/ cgi/ explore . cgi?pdbld=2rf 2 

[126] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=2rki 

[ 1 27] http :// www . rcsb . org/pdb/ cgi/ explore . cgi?pdbld=2vg5 

[128] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=2 vg6 

[129] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=2 vg7 

[130] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=2zd 1 

[131] http :// www . rcsb . org/pdb/ cgi/ explore . cgi?pdbld=2ze2 

[132] http://www.rcsb. org/pdb/cgi/explore.cgi?pdbId=3bgr 

[133] http://www.rcsb. org/pdb/cgi/explore.cgi?pdbId=3c6t 

[134] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=3 c6u 

[135] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=3 di6 

[136] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=3 die 

[137] http :// www . rcsb . org/pdb/ cgi/ explore . cgi?pdbld=3dlg 

[138] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=3 dlk 

[139] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=3 dm2 

[140] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=3 dok 

[141] http :// www . rcsb . org/pdb/ cgi/ explore . cgi?pdbId=3dol 

[ 1 42] http :// www . rcsb . org/pdb/ cgi/ explore . cgi?pdbld=3drp 

[143] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=3 drr 

[144] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=3 drs 

[145] http : / / www . rc sb . org/pdb/ cgi/ explore . cgi ?pdbld=3h vt 

[146] http : / / www . chem. qmul . ac . uk/iubmb/ enzyme/EC2/7/7/49 . html 

[147] http://toolserver.org/~magnus/cas.php?language=en&cas=9068-38-6&title= 

[148] http : / / www . ebi . ac . uk/intenz/ query ?cmd=S earchEC& ec=2 .7.7.49 

[ 1 49] http :// www . brenda-enzymes . org/ php/ result_flat. php4?ecno=2 .7.7.49 

[150] http : / / www . expasy . org/ enzyme/ 2 . 7 . 7 . 49 

[151] http : / / www . genome . ad . j p/ dbget-bin/ w w w_bget ?enzyme+2 . 1.1. 49 

[152] http ://biocyc . org/MET A/substring-search?type=NIL&obj ect=2 .7.7.49 

[153] http://bioinfo.genopole-toulouse. prd.fr/priam/cgi-bin/PRIAM_profiles_CurrentRelease.pl ?EC=2. 7. 7. 49 

[154] http : / / www . ebi . ac . uk/thornton- srv/ databases/ cgi-bin/ enzymes/ GetPage . pi ?ec_number=2 .7.7.49 

[155] http://amigo. geneontology.org/cgi-bin/amigo/go. cgi?query=GO:0003964&view=details 

[156] http : / / www . ebi . ac . uk/ego/Display GoTerm?id=GO : 0003 964 &f ormat=normal 

[157] http://www.ncbi. nlm.nih.gov/entrez/query.fcgi?db=pubmed&term=2. 7. 7. 49%5BEC/ 

RN%20Number%5D%20AND%20pubmed%20pmc%201ocal%5Bsb%5D 

[158] http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&term=2.7.7.49%5BEC/RN%20Number%5D 

[159] http : / / www . rc sb . org/pdb/ explore/ explore . do ?structureld= 1 HM V 

[160] Bio-Medicine.org - Retrovirus (http://www.bio-medicine.org/biology-definition/Retrovirus/) Retrieved on 17 Feb, 2009 

[161] Telesnitsky, A., Goff, S.P. (1993). "Strong-stop strand transfer during reverse transcription", in Skalka, M. A., Goff, S.P. Reverse 
transcriptase (1st ed.). New York: Cold Spring Harbor, p. 49. ISBN 0-87969-382-7. 



Reverse transcriptase 



117 



[162] Bernstein, A.; Weiss, Robin; Tooze, John (1985). "RNA tumor viruses". Molecular Biology of Tumor Viruses (2 n ed.). Cold Spring 

Harbor, N.Y: Cold Spring Harbor Laboratory. 
[163] Doc Kaiser's Microbiology Home Page > IV. VIRUSES > F. ANIMAL VIRUS LIFE CYCLES > 3. The Life Cycle of HIV (http:// 

student.ccbcmd.edu/courses/biol41/lecguide/unit3/viruses/hivlc.html) Community College of Baltimore County. Updated: Jan., 2008 
[164] Monty Krieger; Matthew P Scott; Matsudaira, Paul T.; Lodish, Harvey F.; Darnell, James E.; Lawrence Zipursky; Kaiser, Chris; Arnold 

Berk (2004). Molecular cell biology (5 th ed.). New York: W.H. Freeman and CO. ISBN 0-7167-4366-3. 
[165] Witzany G (August 2008). "The Viral Origins of Telomeres and Telomerases and their Important Role in Eukaryogenesis and Genome 

Maintenance". Biosemiotics 1 (2): 191-206. doi:10.1007/sl2304-008-9018-0. 
[166] Hurwitz J, Leis JP (January 1972). "RNA-dependent DNA polymerase activity of RNA tumor viruses. I. Directing influence of DNA in the 

reaction" (http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=356270). J. Virol. 9 (1): 116-29. PMID 4333538. 

PMC 356270. 

[167] Bbenek, K., Kunkel, A. T (1993). "The fidelity of retroviral reverse transcriptases", in Skalka, M. A., Goff, P. S.. Reverse transcriptase. 

New York: Cold Spring Harbor Laboratory Press, p. 85. ISBN 0-87969-382-7. 
[168] Promega kit instruction manual (1999) (http://www.promega.com/pnotes/71/7807_22/7807_22_core.pdf) 
[169] http://www.nlm.nih. gov/cgi/mesh/2009/MB_cgi?mode=&term=RNA+Transcriptase 
[170] http://www. tibotec.com/bgdisplay .jhtml?itemname=HIV_discovery&product=none&s=2 
[171] http://www.rcsb.org/pdb/static.do?p=education_discussion/molecule_of_the_month/pdb33_l.html 
[172] http://www.youtube.com/watch?v=R08MP3wMvqg 

DNA microarray 



For terminology, see glossary 
below. 

A DNA microarray is a multiplex 
technology used in molecular biology 
and in medicine. It consists of an 
arrayed series of thousands of 
microscopic spots of DNA 
oligonucleotides, called features, each 

-12 

containing picomoles (10 moles) of 
a specific DNA sequence, known as 
probes (or reporters). This can be a 
short section of a gene or other DNA 

element that are used to hybridize a Example of an approximately 40,000 probe spotted oligo microarray with enlarged inset 
cDNA or cRNA sample (called target) t0 show detail - 

under high- stringency conditions. 

Probe-target hybridization is usually detected and quantified by detection of fluorophore-, silver-, or 
chemiluminescence-labeled targets to determine relative abundance of nucleic acid sequences in the target. Since an 
array can contain tens of thousands of probes, a microarray experiment can accomplish many genetic tests in 
parallel. Therefore arrays have dramatically accelerated many types of investigation. 

In standard microarray s, the probes are attached via surface engineering to a solid surface by a covalent bond to a 
chemical matrix (via epoxy-silane, amino-silane, lysine, polyacrylamide or others). The solid surface can be glass or 
a silicon chip, in which case they are commonly known as gene chip or colloquially Affy chip when an Affymetrix 
chip is used. Other microarray platforms, such as Illumina, use microscopic beads, instead of the large solid support. 
DNA arrays are different from other types of microarray only in that they either measure DNA or use DNA as part of 
its detection system. 

DNA microarrays can be used to measure changes in expression levels, to detect single nucleotide polymorphisms 
(SNPs) , to genotype or resequence mutant genomes {see uses and types section). Microarrays also differ in 
fabrication, workings, accuracy, efficiency, and cost {see fabrication section). Additional factors for microarray 
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experiments are the experimental design and the methods of analyzing the data (see Bio informatics section). 



History 

Microarray technology evolved from Southern blotting, where fragmented DNA is attached to a substrate and then 
probed with a known gene or fragment. The use of a collection of distinct DNAs in arrays for expression profiling 
was first described in 1987, and the arrayed DNAs were used to identify genes whose expression is modulated by 
interferon.^ These early gene arrays were made by spotting cDNAs onto filter paper with a pin-spotting device. The 
use of miniaturized microarray s for gene expression profiling was first reported in 1995, and a complete 
eukaryotic genome (Saccharomyces cerevisiae) on a microarray was published in 1997. 



Principle 



fixed probes 




different features 

(e.g. bind different genes) 




labelled target (sample) 



The core principle behind microarrays 
is hybridization between two DNA 
strands, the property of complementary 
nucleic acid sequences to specifically 
pair with each other by forming 
hydrogen bonds between 

complementary nucleotide base pairs. 
A high number of complementary base 
pairs in a nucleotide sequence means 
tighter non-covalent bonding between 
the two strands. After washing off of 
non-specific bonding sequences, only 
strongly paired strands will remain 
hybridized. So fluorescently labelled 

target sequences that bind to a probe sequence generate a signal that depends on the strength of the hybridization 
determined by the number of paired bases, the hybridization conditions (such as temperature), and washing after 
hybridization. Total strength of the signal, from a spot (feature), depends upon the amount of target sample binding 
to the probes present on that spot. Microarrays use relative quantitation in which the intensity of a feature is 
compared to the intensity of the same feature under a different condition, and the identity of the feature is known by 
its position. An alternative to microarrays is serial analysis of gene expression, where the transcriptome is sequenced 
allowing an absolute measurement. 



Fully complementary Partially complementary 
strands bind strongly strands bind weakly 



hybridization of the target to the probe 




Sample Purification RT Coupling Hybridization Scanning Normalization 

and washes and analysis 



The step required in a microarray experiment 
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Uses and types 

Many types of array exist and the broadest distinction is whether they are 
spatially arranged on a surface or on coded beads: 

• The traditional solid-phase array is a collection of orderly microscopic 
"spots", called features, each with a specific probe attached to a solid 
surface, such as glass, plastic or silicon biochip (commonly known gene 
chip, genome chip, DNA chip or gene array). Thousands of them can be 
placed in known locations on a single DNA microarray. 

• The alternative bead array is a collection of microscopic polystyrene 
beads, each with a specific probe and a ratio of two or more dyes, which 
do not interfere with the fluorescent dyes used on the target sequence. 

DNA microarrays can be used to detect DNA (as in comparative genomic hybridization), or detect RNA (most 
commonly as cDNA after reverse transcription) that may or may not be translated into proteins. The process of 
measuring gene expression via cDNA is called expression analysis or expression profiling. 
Applications include: 



Application or 
technology 


Synopsis 


Gene expression 
profiling 


In an mRNA or gene expression profiling experiment the expression levels of thousands of genes are simultaneously 

monitored to study the effects of certain treatments, diseases, and developmental stages on gene expression. For example, 

microarray-based gene expression profiling can be used to identify genes whose expression is changed in response to 

[4] 

pathogens or other organisms by comparing gene expression in infected to that in uninfected cells or tissues. 


Comparative genomic 
hybridization 


Assessing genome content in different cells or closely related organisms. ^ ^ 


GenelD 


Small microarrays to check IDs of organisms in food and feed (like GMO [7]), mycoplasms in cell culture, or pathogens 
for disease detection, mostly combining PCR and microarray technology. 


Chromatin 

immunoprecipitation on 
Chip 


DNA sequences bound to a particular protein can be isolated by immunoprecipitating that protein (ChIP), these fragments 
can be then hybridized to a microarray (such as a tiling array) allowing the determination of protein binding site 
occupancy throughout the genome. Example protein to immunoprecipitate are histone modifications (H3K27me3, 
H3K4me2, H3K9me3, etc), Polycomb-group protein (PRC2:Suzl2, PRC1:YY1) and trithorax-group protein (Ashl) to 
study the epigenetic landscape or RNA Polymerase II to study the transcription landscape. 


DamID 


Analogously to ChIP, genomic regions bound by a protein of interest can be isolated and used to probe a microarray to 
determine binding site occupancy. Unlike ChIP, DamID does not require antibodies but makes use of adenine 
methylation near the protein's binding sites to selectively amplify those regions, introduced by expressing minute 
amounts of protein of interest fused to bacterial DNA adenine methyltransferase. 


SNP detection 


rgn 

Identifying single nucleotide polymorphism among alleles within or between populations. Several applications of 
microarrays make use of SNP detection, including Genotyping, forensic analysis, measuring predisposition to disease, 
identifying drug-candidates, evaluating germline mutations in individuals or somatic mutations in cancers, assessing loss 
of heterozygosity, or genetic linkage analysis. 


Alternative splicing 
detection 


An 'exon junction array design uses probes specific to the expected or potential splice sites of predicted exons for a gene. 
It is of intermediate density, or coverage, to a typical gene expression array (with 1-3 probes per gene) and a genomic 
tiling array (with hundreds or thousands of probes per gene). It is used to assay the expression of alternative splice forms 
of a gene. Exon arrays have a different design, employing probes designed to detect each individual exon for known or 
predicted genes, and can be used for detecting different splicing isoforms. 


Fusion genes 
microarray 


A Fusion gene microarray can detect fusion transcripts, e.g. from cancer specimens. The principle behind this is building 
on the alternative splicing microarrays. The oligo design strategy enables combined measurements of chimeric transcript 
junctions with exon- wise measurements of individual fusion partners. 




Two Affymetrix chips 
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Tiling array 



Genome tiling arrays consist of overlapping probes designed to densely represent a genomic region of interest, sometimes 
as large as an entire human chromosome. The purpose is to empirically detect expression of transcripts or alternatively 
splice forms which may not have been previously known or predicted. 



Fabrication 

Microarrays can be manufactured in different ways, depending on the number of probes under examination, costs, 
customization requirements, and the type of scientific question being asked. Arrays may have as few as 10 probes or 
up to 2.1 million (NimbleGen, Roche) micrometre- scale probes from commercial vendors. 

Spotted vs. in situ synthesised arrays 

Microarrays can be fabricated using a variety of technologies, including printing with fine-pointed pins onto glass 
slides, photolithography using pre-made masks, photolithography using dynamic micromirror devices, ink-jet 

rm 

printing, 1 or electrochemistry on microelectrode arrays. 

In spotted microarrays, the probes are oligonucleotides, cDNA or small fragments of PCR products that correspond 
to mRNAs. The probes are synthesized prior to deposition on the array surface and are then "spotted" onto glass. A 
common approach utilizes an array of fine pins or needles controlled by a robotic arm that is dipped into wells 
containing DNA probes and then depositing each probe at designated locations on the array surface. The resulting 
"grid" of probes represents the nucleic acid profiles of the prepared probes and is ready to receive complementary 
cDNA or cRNA "targets" derived from experimental or clinical samples. This technique is used by research 
scientists around the world to produce "in-house" printed microarrays from their own labs. These arrays may be 
easily customized for each experiment, because researchers can choose the probes and printing locations on the 
arrays, synthesize the probes in their own lab (or collaborating facility), and spot the arrays. They can then generate 
their own labeled samples for hybridization, hybridize the samples to the array, and finally scan the arrays with their 
own equipment. This provides a relatively low-cost microarray that may be customized for each study, and avoids 
the costs of purchasing often more expensive commercial arrays that may represent vast numbers of genes that are 
not of interest to the investigator. Publications exist which indicate in-house spotted microarrays may not provide the 
same level of sensitivity compared to commercial oligonucleotide arrays, ^ possibly owing to the small batch sizes 
and reduced printing efficiencies when compared to industrial manufactures of oligo arrays. 

In oligonucleotide microarrays, the probes are short sequences designed to match parts of the sequence of known or 
predicted open reading frames. Although oligonucleotide probes are often used in "spotted" microarrays, the term 
"oligonucleotide array" most often refers to a specific technique of manufacturing. Oligonucleotide arrays are 
produced by printing short oligonucleotide sequences designed to represent a single gene or family of gene 
splice- variants by synthesizing this sequence directly onto the array surface instead of depositing intact sequences. 
Sequences may be longer (60-mer probes such as the Agilent design) or shorter (25-mer probes produced by 
Affymetrix) depending on the desired purpose; longer probes are more specific to individual target genes, shorter 
probes may be spotted in higher density across the array and are cheaper to manufacture. One technique used to 
produce oligonucleotide arrays include photolithographic synthesis (Agilent and Affymetrix) on a silica substrate 
where light and light-sensitive masking agents are used to "build" a sequence one nucleotide at a time across the 
entire array J 1 ^ Each applicable probe is selectively "unmasked" prior to bathing the array in a solution of a single 
nucleotide, then a masking reaction takes place and the next set of probes are unmasked in preparation for a different 
nucleotide exposure. After many repetitions, the sequences of every probe become fully constructed. More recently, 
Maskless Array Synthesis from NimbleGen Systems has combined flexibility with large numbers of probes. 
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Two-channel vs. one-channel detection 

Two-color microarrays or two-channel microarrays are typically 

hybridized with cDNA prepared from two samples to be compared 

(e.g. diseased tissue versus healthy tissue) and that are labeled with two 
ri3i 

different fluorophores. Fluorescent dyes commonly used for cDNA 
labeling include Cy3, which has a fluorescence emission wavelength of 
570 nm (corresponding to the green part of the light spectrum), and 
Cy5 with a fluorescence emission wavelength of 670 nm 
(corresponding to the red part of the light spectrum). The two 
Cy-labeled cDNA samples are mixed and hybridized to a single 
microarray that is then scanned in a microarray scanner to visualize 
fluorescence of the two fluorophores after excitation with a laser beam 
of a defined wavelength. Relative intensities of each fluorophore may 
then be used in ratio-based analysis to identify up-regulated and 
down-regulated genes P ^ 

Oligonucleotide microarrays often carry control probes designed to 
hybridize with RNA spike-ins. The degree of hybridization between 
the spike-ins and the control probes is used to normalize the Diagram of typical dual-colour microarray 

experiment. 

hybridization measurements for the target probes. Although absolute 
levels of gene expression may be determined in the two-color array in 

rare instances, the relative differences in expression among different spots within a sample and between samples is 
the preferred method of data analysis for the two-color system. Examples of providers for such microarrays includes 
Agilent with their Dual-Mode platform, Eppendorf with their DualChip platform for colorimetric Silverquant 
labeling, and TeleChem International with Arrayit. 

In single-channel microarrays or one-color microarrays, the arrays provide intensity data for each probe or probe set 
indicating a relative level of hybridization with the labeled target. However, they do not truly indicate abundance 
levels of a gene but rather relative abundance when compared to other samples or conditions when processed in the 
same experiment. Each RNA molecule encounters protocol and batch-specific bias during amplification, labeling, 
and hybridization phases of the experiment making comparisons between genes for the same microarray 
uninformative. The comparison of two conditions for the same gene requires two separate single-dye hybridizations. 
Several popular single-channel systems are the Affymetrix "Gene Chip", Illumina "Bead Chip", Agilent 
single-channel arrays, the Applied Microarrays "CodeLink" arrays, and the Eppendorf "DualChip & Silverquant". 
One strength of the single-dye system lies in the fact that an aberrant sample cannot affect the raw data derived from 
other samples, because each array chip is exposed to only one sample (as opposed to a two-color system in which a 
single low-quality sample may drastically impinge on overall data precision even if the other sample was of high 
quality). Another benefit is that data are more easily compared to arrays from different experiments so long as batch 
effects have been accounted for. A drawback to the one-color system is that, when compared to the two-color 
system, twice as many microarrays are needed to compare samples within an experiment. 
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Microarrays and bioinformatics 




The advent of inexpensive microarray experiments created several specific 
bioinformatics challenges: 

• the multiple levels of replication in experimental design (Experimental 
design) 

• the number of platforms and independent groups and data format 
(Standardization) 

• the treatment of the data (Statistical analysis) 

• accuracy and precision (Relation between probe and gene) 

• the sheer volume of data and the ability to share it (Data warehousing) 

Experimental design 

Due to the biological complexity of gene expression, the considerations of 
experimental design that are discussed in the expression profiling article are 

of critical importance if statistically and biologically valid conclusions are to be drawn from the data. 

There are three main elements to consider when designing a microarray experiment. First, replication of the 
biological samples is essential for drawing conclusions from the experiment. Second, technical replicates (two RNA 
samples obtained from each experimental unit) help to ensure precision and allow for testing differences within 
treatment groups. The technical replicates may be two independent RNA extractions or two aliquots of the same 
extraction. Third, spots of each cDNA clone or oligonucleotide are present as replicates (at least duplicates) on the 
microarray slide, to provide a measure of technical precision in each hybridization. It is critical that information 
about the sample preparation and handling is discussed, in order to help identify the independent units in the 
experiment and to avoid inflated estimates of statistical significance. 



Gene expression values from microarray 
experiments can be represented as heat 
maps to visualize the result of data analysis. 



[15] 



Standardization 

Microarray data is difficult to exchange due to the lack of standardization in platform fabrication, assay protocols, 
and analysis methods. This presents an interoperability problem in bioinformatics. Various grass-roots open-source 
projects are trying to ease the exchange and analysis of data produced with non-proprietary chips: 

• For example, the "Minimum Information About a Microarray Experiment" (MIAME) checklist helps define the 
level of detail that should exist and is being adopted by many journals as a requirement for the submission of 
papers incorporating microarray results. But MIAME does not describe the format for the information, so while 
many formats can support the MIAME requirements, as of 2007 no format permits verification of complete 
semantic compliance. 

• The "MicroArray Quality Control (MAQC) Project" is being conducted by the US Food and Drug Administration 
(FDA) to develop standards and quality control metrics which will eventually allow the use of MicroArray data in 
drug discovery, clinical practice and regulatory decision-making J 

• The MGED Society has developed standards for the representation of gene expression experiment results and 
relevant annotations. 
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Statistical analysis 

Microarray data sets are commonly very large, and analytical precision is influenced by a number of variables. 
Statistical challenges include taking into account effects of background noise and appropriate normalization of the 
data. Normalization methods may be suited to specific platforms and, in the case of commercial platforms, the 
analysis may be proprietary. Algorithms that affect statistical analysis include: 

• Image analysis: gridding, spot recognition of the scanned image (segmentation algorithm), removal or marking of 
poor-quality and low-intensity features (called flagging). 

• Data processing: background subtraction (based on global or local background), determination of spot intensities 
and intensity ratios, visualisation of data (e.g see MA plot), and log-transformation of ratios, global or local 
normalization of intensity ratios. 

• Identification of statistically significant changes: t-test, ANOVA, Bayesian method or Mann-Whitney test 

ri7i 

methods tailored to microarray data sets, which take into account multiple comparisons. These methods assess 
statistical power based on the variation present in the data and the number of experimental replicates, and can 

ri on 

help minimize Type I and type II errors in the analyses. 

• Network-based methods: Statistical methods that take the underlying structure of gene networks into account, 

ri9i 

representing either associative or causative interactions or dependencies among gene products. 

Microarray data may require further processing aimed at reducing the dimensionality of the data to aid 
comprehension and more focused analysis P 0 ^ Other methods permit analysis of data consisting of a low number of 
biological or technical replicates; for example, the Local Pooled Error (LPE) test pools standard deviations of genes 
with similar expression levels in an effort to compensate for insufficient replication. 1 

Relation between probe and gene 

The relation between a probe and the mRNA that it is expected to detect is problematic. On the one hand, some 
mRNAs may cross-hybridize probes in the array that are supposed to detect another mRNA. In addition, mRNAs 
may experience amplification bias that is sequence or molecule- specific. On the other hand, probes that are designed 
to detect the mRNA of a particular gene may be relying on genomic EST information that is incorrectly associated 
with that gene. 

Data warehousing 

Microarray data was found to be more useful when compared to other similar datasets. The sheer volume (in bytes), 
specialized formats (such as MIAME), and curation efforts associated with the datasets require specialized databases 
to store the data. 

See also 

• Cyanine dyes, such as Cy3 and Cy5, are commonly used fluorophores with microarrays 

• FatiGO 

• Full Genome Sequencing 

• Gene chip analysis 

• Microfluidics or lab-on-chip 

• Serial analysis of gene expression 

• Significance analysis of microarrays 

• Systems biology 
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Glossary 

• An Array or slide is a collection of features spatially arranged in a two dimensional grid, arranged in columns 
and rows. 

• Block or subarray: a group of spots, typically made in one print round; several subarrays/blocks form an array. 

• Case/control: an experimental design paradigm especially suited to the two-colour array system, in which a 
condition chosen as control (such as healthy tissue or state) is compared to an altered condition (such as a 
diseased tissue or state). 

• Channel: the fluorescence output recorded in the scanner for an individual fluorophore and can even be 
ultraviolet. 

• Dye flip or Dye swap or Fluor reversal: reciprocal labelling of DNA targets with the two dyes to account for dye 
bias in experiments. 

• Scanner: an instrument used to detect and quantify the intensity of fluorescence of spots on a microarray slide, by 
selectively exciting fluorophores with a laser and measuring the fluorescence with a filter (optics) photomultiplier 
system. 

• Spot or feature: a small area on an array slide that contains picomoles of specific DNA samples. 

• For other relevant terms see: 

Glossary of gene expression terms 
Protocol (natural sciences) 



External links 

• Many important links can be found at the Open Directory Project 
T221 

• Gene Expression at the Open Directory Project 

T231 

• Micro Scale Products and Services for Biochemistry and Molecular Biology at the Open Directory Project 

• Products and Services for Gene Expression ^ at the Open Directory Project 

PLoS Biology Primer: Microarray Analysis 

Rundown of microarray technology ^ 
[27] 

ArrayMining.net - a free web-server for online microarray analysis 
CLASSIFI t28] - Gene Ontology-based gene cluster classification resource 
Microarray - How does it work? 

What Are DNA Microarray s ^ - A Non-Biologists Introduction to Microarray s 
Microarray data processing using Self-Organizing Maps tutorial: Part 1 Part 2 

PNAS Commentary: Discovery of Principles of Nature from Mathematical Modeling of DNA Microarray Data 

[33] 
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Triple-stranded DNA 

A triple-stranded DNA is a structure of DNA in which three oligonucleotides wind around each other and form a 
triple helix. In this structure, one strand binds to a B-form DNA double helix through Hoogsteen or reversed 
Hoogsteen hydrogen bonds. 

For example, a nucleobase T binds to a Watson-Crick base-pairing of T-A by Hoogsteen hydrogen bonds between an 
AxT pair (x represents a Hoogsteen base pair). An N-3 protonated cytosine, represented as C+, can also form a 
base-triplet with a C-G pair through the Hoogsteen base-pairing of an GxC+. Thus, the triple-helical DNAs using 
these Hoogsteen pairings consist of two homopyrimidines and one homopurine, and the homopyrimidine third strand 
is parallel to the homopurine strand. 

A homopurine third strand can also bind to a homopurine-homopyrimidine duplex using reversed Hoogsteen 
patterns. In this triplex, a nucleobase A binds to a T-A base pair and a G to a C-G pair. Since the nucleobases on the 
third strand have to be reversed, the homopurine third strand is antiparallel to the homopurine strand of the original 
duplex. 

Triple-stranded DNA was a common hypothesis in the 1950s when scientists were struggling to discover DNA's true 
structural from. Watson and Crick (who later won the Nobel Prize for their double-helix model) originally 
considered a triple-helix model, as did Pauling and Corey who published a proposal for their triple-helix model in the 
1953 scientific journal Nature, as well as fellow scientist Fraser. However, Watson and Crick soon identified several 
problems with these models: 1) Negatively charged phosphates near the axis will repel each other, leaving the 
question as to how the three-chain structure would stay together. 2) In a triple-helix model (specifically Pauling and 
Corey's model), some of the van der Waals distances appear to be too small. Fraser's model differed from Pauling 
and Corey's in that in his model the phosphates are on the outside and the bases are on the inside, linked together by 
hydrogen bonds. However, Watson and Crick found Fraser's model to be too ill-defined to comment specifically on 
its inadequacies in their publication in "Nature" (1953): Molecular Structure of Nucleic Acids. 

Triple-stranded DNA was also described in 1957, when it was thought to occur in only one in vivo biological 
process: as an intermediate product during the action of the E. coli recombination enzyme RecA. Its role in that 
process is not understood. 

Using nucleic acid segments that bind to the DNA duplexes to form triple strands as a way of regulating gene 
expression is under investigation by biotechnology companies. Similar work is also being undertaken at Yale 
University. 
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G-quadruplex 

Nucleic acid sequences which are rich in guanine are capable of forming four- stranded structures called 
G-quadruplexes (also known as G-tetrads or G 4 -DNA). These consist of a square arrangement of guanines (a tetrad), 
stabilized by Hoogsteen hydrogen bonding. They are further stabilized by the existence of a monovalent cation 
(especially potassium) in the center of the tetrads. They can be formed of DNA, RNA, LNA and PNA, and may be 
intramolecular, bimolecular or tetramolecular. Depending on the direction of the strands or parts of a strand that form 
the tetrads, structures may be described as parallel or antiparallel. 



Telomeric quadruplexes 

Telomeric repeats in a variety of 
organisms have been shown to form 
these structures in vitro, and they have 
also been shown to form in vivo in 
some cases. The human telomeric 
repeat (which is the same for all 
vertebrates) consists of many repeats 
of the sequence d(GGTTAG), and the 
quadruplexes formed by this structure 
have been well studied by NMR and 
X-ray crystal structure determination. 
The formation of these quadruplexes in 
telomeres has been shown to decrease 
the activity of the enzyme telomerase, which is responsible for maintaining length of telomeres and is involved in 
around 85% of all cancers. This is an active target of drug discovery. 



r H Loop 1 Loop 3 




Structure of a G-quadruplex. Left: a G-tetrad. Right: an intramolecular G-quadruplex 
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Non-telomeric quadruplexes 

Recently, there has been increasing interest in quadruplexes in 
locations other than at the telomere. This was given a large boost by 
the work by Simonsson [1] and Hurley [2] on the proto-oncogene 
c-myc, which was shown to form a quadruplex in a nuclease 
hypersensitive region critical for gene activity. Since then, many other 
genes have been shown to have G-quadruplexes in their promoter 
regions, including the chicken (3-globin gene, human ubiquitin-ligase 
RFP2 and the proto-oncogenes c-kit, bcl-2, VEGF, H-ras and N-ras. 




This list is ever-increasing. 3D Stmcture of the intramolecular human 

telomeric G-quadruplex in potassium solution 
(PDB ID 2HY9). The backbone is represented by 
a tube. The center of this structure contains three 
layers of G-tetrads. The hydrogen bonds in these 
layers are represented by blue dashed lines. 



Genome-wide surveys based on a quadruplex folding rule have been performed, which have identified 376,000 
Putative Quadruplex Sequences (PQS) in the human genome, although not all of these probably form in vivo. [3] A 
similar study has identified putative G-quadruplexes in prokaryotes[4].There are several possible models for how 
quadruplexes could control gene activity, either by upregulation or downregulation. One model is shown below, with 
G-quadruplex formation in or near a promoter blocking transcription of the gene, and hence de-activating it. In 
another model, quadruplex formed at the non-coding DNA strand helps to maintain an open conformation of the 
coding DNA strand and enhance an expression of the respective gene. 
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Ligands which bind 
quadruplexes 

One way of inducing or stabilizing 
G-quadruplex formation, is to 
introduce a molecule which can bind to 
the G-quadruplex structure, and a 
number of ligands, both small 
molecules and proteins, have been 
developed which can do so. This has 
become an increasingly large field of 
research. 

A number of naturally occurring 
proteins have been identified which 
selectively bind to G-quadruplexes. 
These include the helicases implicated 
in Bloom's and Werner's syndromes 

and the Saccharomyces cerevisiae protein RAP1. An artificially derived three zinc finger protein called Gql, which 
is specific for G-quadruplexes has also been developed, as have specific antibodies. 

Cationic porphyrins have been shown to bind intercalatively with G-quadruplexes, as well as the molecule 
telomestatin. 



■> 



Model for quadruplex-mediated down-regulation of gene expression 
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Quadruplex prediction techniques 

Identifying and predicting sequences which have the capacity to form quadruplexes is an important tool in further 
understanding of their role. A rule for predicting the formation has been proposed, where sequences are predicted to 
fold based on the pattern d(G 3+ N 1 ? G 3+ N 1 ? G 3+ N 1 ? G 3+ ), where N is any base (including guanine). This rule has 
been widely used in on-line algorithms. 



External links 

Books 

• Quadruplex Nucleic Acids (ISBN 0-85404-374-8) Neidle & Balasubramanian (Eds.) 2006 



Quadruplex websites 

• Quadruplex.org ^ - a website to serve the quadruplex community 

• Quadbase - downloadable data on predicted G-quadruplexes 

• Greglist ^ - a database listing potential G-quadruplex regulated genes 

rm 

• Database on Quadruplex information: QuadBase from IGIB 

• GRSDB tl0] - a database of G-quadruplexes near RNA processing sites. 

• GRSJJTRdb [1 1] - a database of G-quadruplexes in the UTRs. 

• G-quadruplex Resource Site 



Research papers 

• In vivo Veritas: Using yeast to probe the biological functions of G-quadruplexes. Johnson JE, Smith JS, Kozak 
ML, Johnson FB. Biochimie. 2008 Feb 21 [13] 

• Prevalence of quadruplexes in the human genome, Huppert and Balasubramanian, NAR 2005 33(9) 2908-2916 

• Highly prevalent putative quadruplex sequence motifs in human DNA, Todd, Johnston, and Neidle, NAR 2005 
33(9) 2901-2907 [14] 

• Quadruplex DNA: sequence, topology and structure, Burge, Parkinson, Hazel, Todd and Neidle, NAR 2006 
34(19) 5402-5415 [15] 

• Direct evidence for a G-quadruplex in a promoter region and its targeting with a small molecule to repress 
c-MYC transcription, Siddiqui-Jain et al, PNAS 2002 99(18) 11593-8 [2] 

• Genome- wide prediction of G4 DNA as regulatory motifs: Role in Escherichia coli global regulation, Rawal P, 
Kummarasetti VB, Ravindran J, Kumar N, Haider K, Sharma R, Mukerji M, Das SK, Chowdhury S., Genome 
Res. 2006 16(5) 644-55 [4] 

• A Biomimetic Potassium Responsive Nanochannel: G-Quadruplex DNA Conformational Switching in a 
Synthetic Nanopore, Xu Hou, Wei Guo, Fan Xia, Fu-Qiang Nie, Hua Dong, Ye Tian, Liping Wen, Lin Wang, 
Liuxuan Cao, Yang Yang, Jianming Xue, Yanlin Song, Yugang Wang, Dongsheng Liu and Lei Jiang, J. Am. 
Chem. Soc, 2009 131(22) 7800-7805 [16] 
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Tools to predict G-quadruplex motifs 

ri7i 

• Quadparser: Downloadable program for finding putative quadruplex-forming sequences 

• QGRS Mapper: a web-based application for predicting G-quadruplexes in nucleotide sequences and NCBI genes 
[18] 

ri9i 

• Quadfinder: Tool for Prediction and Analysis of G Quadruplex Motifs in DNA/RNA Sequences from IGIB 
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DNA Analysis 

Genetic Testing : Gene tests (also called DNA-based tests), the newest and most sophisticated of the techniques 
used to test for genetic disorders, involve direct examination of the DNA molecule itself. Other genetic tests include 
biochemical tests for such gene products as enzymes and other proteins and for microscopic examination of stained 
or fluorescent chromosomes. Genetic tests are used for several reasons, including: 

• carrier screening, which involves identifying unaffected individuals who carry one copy of a gene for a disease 
that requires two copies for the disease to be expressed 

• preimplantation genetic diagnosis (see the side bar, Screening Embryos for Disease) 

• prenatal diagnostic testing 

• newborn screening 

• presymptomatic testing for predicting adult-onset disorders such as Huntington's disease 

• presymptomatic testing for estimating the risk of developing adult-onset cancers and Alzheimer's disease 

• confirmational diagnosis of a symptomatic individual 

• forensic/identity testing 

Genetic testing allows the genetic diagnosis of vulnerabilities to inherit diseases, and can also be used to determine a 
child's paternity (genetic father) or a person's ancestry. Normally, every person carries two copies of every gene, one 
inherited from their mother, one inherited from their father. The human genome is believed to contain around 20,000 
- 25,000 genes. In addition to studying chromosomes to the level of individual genes, genetic testing in a broader 
sense includes biochemical tests for the possible presence of genetic diseases, or mutant forms of genes associated 
with increased risk of developing genetic disorders. Genetic testing identifies changes in chromosomes, genes, or 
proteins Most of the time, testing is used to find changes that are associated with inherited disorders. The results 
of a genetic test can confirm or rule out a suspected genetic condition or help determine a person's chance of 
developing or passing on a genetic disorder. Several hundred genetic tests are currently in use, and more are being 
developed. ^ ^ 

Since genetic testing may open up ethical or psychological problems, genetic testing is often accompanied by genetic 
counseling. 

Types 

Genetic testing is "the analysis of, chromosomes (DNA), proteins, and certain metabolites in order to detect heritable 
disease-related genotypes, mutations, phenotypes, or karyotypes for clinical purposes. It can provide information 
about a person's genes and chromosomes throughout life. Available types of testing include: 

• Newborn screening: Newborn screening is used just after birth to identify genetic disorders that can be treated 
early in life. The routine testing of infants for certain disorders is the most widespread use of genetic 
testing — millions of babies are tested each year in the United States. All states currently test infants for 
phenylketonuria (a genetic disorder that causes mental illness if left untreated) and congenital hypothyroidism (a 
disorder of the thyroid gland). 

• Diagnostic testing: Diagnostic testing is used to diagnose or rule out a specific genetic or chromosomal condition. 
In many cases, genetic testing is used to confirm a diagnosis when a particular condition is suspected based on 
physical mutations and symptoms. Diagnostic testing can be performed at any time during a person's life, but is 
not available for all genes or all genetic conditions. The results of a diagnostic test can influence a person's 
choices about health care and the management of the disease. 

• Carrier testing: Carrier testing is used to identify people who carry one copy of a gene mutation that, when present 
in two copies, causes a genetic disorder. This type of testing is offered to individuals who have a family history of 
a genetic disorder and to people in ethnic groups with an increased risk of specific genetic conditions. If both 
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parents are tested, the test can provide information about a couple's risk of having a child with a genetic condition. 

• Prenatal testing: Prenatal testing is used to detect changes in a fetus's genes or chromosomes before birth. This 
type of testing is offered to couples with an increased risk of having a baby with a genetic or chromosomal 
disorder. In some cases, prenatal testing can lessen a couple's uncertainty or help them decide whether to abort the 
pregnancy. It cannot identify all possible inherited disorders and birth defects, however. 

• Preimplantation genetic diagnosis: Genetic testing procedures that are performed on human embryos prior to the 
implantation as part of an in vitro fertilization procedure. 

• Predictive and presymptomatic testing: Predictive and presymptomatic types of testing are used to detect gene 
mutations associated with disorders that appear after birth, often later in life. These tests can be helpful to people 
who have a family member with a genetic disorder, but who have no features of the disorder themselves at the 
time of testing. Predictive testing can identify mutations that increase a person's chances of developing disorders 
with a genetic basis, such as certain types of cancer. For example, an individual with a mutation in BRCA1 has a 
65% cumulative risk of breast cancer [5]. Presymptomatic testing can determine whether a person will develop a 
genetic disorder, such as hemochromatosis (an iron overload disorder), before any signs or symptoms appear. The 
results of predictive and presymptomatic testing can provide information about a person's risk of developing a 
specific disorder and help with making decisions about medical care. 

• Forensic testing: Forensic testing uses DNA sequences to identify an individual for legal purposes. Unlike the 
tests described above, forensic testing is not used to detect gene mutations associated with disease. This type of 
testing can identify crime or catastrophe victims, rule out or implicate a crime suspect, or establish biological 
relationships between people (for example, paternity). 

• Parental testing: This type of genetic test uses special DNA markers to identify the same or similar inheritance 
patterns between related individuals. Based on the fact that we all inherit half of our DNA from the father, and 
half from the mother, DNA scientists test individuals to find the match of DNA sequences at some highly 
differential markers to draw the conclusion of relatedness. 

• Research testing: Research testing includes finding unknown genes, learning how genes work and advancing our 
understanding of genetic conditions. The results of testing done as part of a research study are usually not 
available to patients or their healthcare providers. 

Medical procedure 

Genetic testing is often done as part of a genetic consultation and as of mid-2008 there were more than 1,200 
clinically applicable genetic tests available J 6 ^ Once a person decides to proceed with genetic testing, a medical 
geneticist, genetic counselor, primary care doctor, or specialist can order the test after obtaining informed consent. 

Genetic tests are performed on a sample of blood, hair, skin, amniotic fluid (the fluid that surrounds a fetus during 
pregnancy), or other tissue. For example, a medical procedure called a buccal smear uses a small brush or cotton 
swab to collect a sample of cells from the inside surface of the cheek. Alternatively, a small amount of saline 
mouthwash may be swished in the mouth to collect the cells. The sample is sent to a laboratory where technicians 
look for specific changes in chromosomes, DNA, or proteins, depending on the suspected disorder. The laboratory 
reports the test results in writing to a person's doctor or genetic counselor. 

Routine newborn screening tests are done on a small blood sample obtained by pricking the baby's heel with a lancet. 
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Interpreting results 

The results of genetic tests are not always straightforward, which often makes them challenging to interpret and 
explain. When interpreting test results, healthcare professionals consider a person's medical history, family history, 
and the type of genetic test that was done. 

A positive test result means that the laboratory found a change in a particular gene, chromosome, or protein of 
interest. Depending on the purpose of the test, this result may confirm a diagnosis, indicate that a person is a carrier 
of a particular genetic mutation, identify an increased risk of developing a disease (such as cancer) in the future, or 
suggest a need for further testing. Because family members have some genetic material in common, a positive test 
result may also have implications for certain blood relatives of the person undergoing testing. It is important to note 
that a positive result of a predictive or presymptomatic genetic test usually cannot establish the exact risk of 
developing a disorder. Also, health professionals typically cannot use a positive test result to predict the course or 
severity of a condition. 

A negative test result means that the laboratory did not find a dangerous copy of the gene, chromosome, or protein 
under consideration. This result can indicate that a person is not affected by a particular disorder, is not a carrier of a 
specific genetic mutation, or does not have an increased risk of developing a certain disease. It is possible, however, 
that the test missed a disease-causing genetic alteration because many tests cannot detect all genetic changes that can 
cause a particular disorder. Further testing may be required to confirm a negative result. 

In some cases, a negative result might not give any useful information. This type of result is called uninformative, 
indeterminate, inconclusive, or ambiguous. Uninformative test results sometimes occur because everyone has 
common, natural variations in their DNA, called polymorphisms, that do not affect health. If a genetic test finds a 
change in DNA that has not been associated with a disorder in other people, it can be difficult to tell whether it is a 
natural polymorphism or a disease-causing mutation. An uninformative result cannot confirm or rule out a specific 
diagnosis, and it cannot indicate whether a person has an increased risk of developing a disorder. In some cases, 
testing other affected and E unaffected family members can help clarify this type of result. 

Risks and limitations 

The physical risks associated with most genetic tests are very small, particularly for those tests that require only a 
blood sample or buccal smear (a procedure that samples cells from the inside surface of the cheek). The procedures 
used for prenatal testing carry a small but real risk of losing the pregnancy (miscarriage) because they require a 
sample of amniotic fluid or tissue from around the fetus. 

Many of the risks associated with genetic testing involve the emotional, social, or financial consequences of the test 
results. People may feel angry, depressed, anxious, or guilty about their results. In some cases, genetic testing creates 
tension within a family because the results can reveal information about other family members in addition to the 
person who is tested. The possibility of genetic discrimination in employment or insurance is also a concern. Some 
individuals avoid genetic testing out of fear it will affect their ability to purchase insurance or find ajob. L,J Health 
insurers do not currently require applicants for coverage to undergo genetic testing, and when insurers encounter 
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genetic information, it is subject to the same confidentiality protections as any other sensitive health information. 
In the United States, the use of genetic information is governed by the Genetic Information Nondiscrimination Act 
(GINA) (see discussion below in "U.S. Government Regulation Related to Genetic Testing and Information"). 

Genetic testing can provide only limited information about an inherited condition. The test often can't determine if a 
person will show symptoms of a disorder, how severe the symptoms will be, or whether the disorder will progress 
over time. Another major limitation is the lack of treatment strategies for many genetic disorders once they are 
diagnosed. 

A genetics professional can explain in detail the benefits, risks, and limitations of a particular test. It is important that 
any person who is considering genetic testing understand and weigh these factors before making a decision. 
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Direct- to- Consumer (DTC) Genetic Testing 

Direct-to-Consumer (DTC) genetic testing is a type of genetic test that is accessible directly to the consumer without 
having to go through a health care professional. Usually, to obtain a genetic test, health care professionals such as 
doctors acquire the permission of the patient and order the desired test. DTC genetic tests, however, allow consumers 
to bypass this process and order one themselves. There are a variety of DTC tests, ranging from testing for breast 
cancer alleles to mutations linked to cystic fibrosis. Benefits of DTC testing are the accessibility of tests to 
consumers, promotion of proactive healthcare and the privacy of genetic information. Possible additional risks of 
DTC testing are the lack of governmental regulation and the potential misinterpretation of genetic information. 

Controversy 

DTC genetic testing has been controversial due to outspoken opposition within the scientific community. Critics of 
DTC testing argue against the risks involved, the unregulated advertising and marketing claims, and the overall lack 
of governmental oversight. 

DTC testing involves many of the same risks associated with any genetic test. One of the more obvious and 
dangerous of these is the possibility of severe misreading of test results. Without professional guidance, consumers 
can potentially misinterpret genetic information, causing them to be deluded about their personal health. 

Some advertising for direct-to-consumer genetic testing has been criticized as conveying an exaggerated and 
inaccurate message about the connection between genetic information and disease risk, utilizing emotions as a 
selling factor. An advertisement for a BRCA-predictive genetic test for breast cancer stated: "There is no stronger 
antidote for fear than information."^ 



U.S. Government Regulation Related to Genetic Testing and Information 

Currently, the U.S. has no strong Federal regulation moderating the DTC market. Though there are several hundred 
tests available, only a handful are approved by the Food and Drug Administration (FDA); these are sold as at-home 
test kits, and are therefore considered "medical devices" over which the FDA may assert jurisdiction. Other types of 
DTC tests require customers to mail in DNA samples for testing; it is difficult for the FDA to exercise jurisdiction 
over these types of tests, because the actual testing is completed in the laboratories of providers. As of 2007, the 
FDA had not yet officially substantiated with scientific evidence the claimed accuracy of the majority of 
direct-to-consumer genetic tests/ 1 ^ 

With regard to genetic testing and information in general, legislation in the United States called the Genetic 
Information Nondiscrimination Act prohibits group health plans and health insurers from denying coverage to a 
healthy individual or charging that person higher premiums based solely on a genetic predisposition to developing a 

disease in the future. The legislation also bars employers from using individuals' genetic information when making 

ri2i ri3i 
hiring, firing, job placement, or promotion decisions. The legislation, the first of its kind in the U.S., was 

passed by the United States Senate on April 24, 2008, on a vote of 95-0, and was signed into law by President 

George W. Bush on May 21, 2008. [14] [15] It went into effect on November 21, 2009. 
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Fiction 

Some possible future ethical problems of genetic testing were considered in the science fiction film Gattaca, and the 
science fiction anime series "Gundam Seed". Also some films which include the topic of genetic testing include, 
"The Island" and the "Resident Evil" series. 

See also 

• Full Genome Sequencing 

• List of human genes 

• List of genetic disorders 

• Gene theft 

External links 

• GeneTests ^ US National Institutes of Health funded resource on genetic testing. 
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• EuroGentest European network for test development, harmonization, validation and standardization 

• TECHGENE tl8] European project on genetic testing and next generation sequencing technology. 

) ri9i 

• "The science and practice of genetic testing for Huntington s disease" . Huntington's Disease Outreach Project 
for Education at Stanford 

• Downloadable article: "Evidence that a West-East admixed population lived in the Tarim Basin as early as the 

early Bronze Age" Li et al. BMC Biology 2010, 8:15. [20] 
T211 

• KnowYourGenes.org Genetic Disease Foundation 
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Human genome 



The human genome is the genome of Homo 
sapiens, which is stored on 23 chromosome 
pairs. Twenty-two of these are autosomal 
chromosome pairs, while the remaining pair 
is sex-determining. The haploid human 
genome occupies a total of just over 3 
billion DNA base pairs. The Human 
Genome Project (HGP) produced a 
reference sequence of the euchromatic 
human genome, which is used worldwide in 
biomedical sciences. 

The haploid human genome contains ca. 

23,000 protein-coding genes, far fewer than 

had been expected before its sequencing 
[21 

In fact, only about 1.5% of the genome 
codes for proteins, while the rest consists of 
non-coding RNA genes, regulatory 
sequences, introns, and (controversially 
named) "junk" DNA. [3] 

Features 

Genes 

There are estimated to be between 20,000 and 25,000 human protein-coding genes. The estimate of the number of 
human genes has been repeatedly revised down from initial predictions of 100,000 or more as genome sequence 
quality and gene finding methods have improved. 

Surprisingly, the number of human genes seems to be less than a factor of two greater than that of many much 
simpler organisms, such as the roundworm and the fruit fly. However, human cells make extensive use of alternative 
splicing to produce several different proteins from a single gene, and the human proteome is thought to be much 
larger than those of the aforementioned organisms. Besides, most human genes have multiple exons, and human 
introns are frequently much longer than the flanking exons. 

Human genes are distributed unevenly across the chromosomes. Each chromosome contains various gene-rich and 
gene-poor regions, which seem to be correlated with chromosome bands and GC-content. The significance of these 
nonrandom patterns of gene density is not well understood. In addition to protein coding genes, the human genome 
contains thousands of RNA genes, including tRNA, ribosomal RNA, microRNA, and other non-coding RNA genes. 
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Regulatory sequences 

The human genome has many different regulatory sequences which are crucial to controlling gene expression. These 
are typically short sequences that appear near or within genes. A systematic understanding of these regulatory 
sequences and how they together act as a gene regulatory network is only beginning to emerge from computational, 
high-throughput expression and comparative genomics studies. Some types of non-coding DNA are genetic 
"switches" that do not encode proteins, but do regulate when and where genes are expressed.^ 

Identification of regulatory sequences relies in part on evolutionary conservation. The evolutionary branch between 
the primates and mouse, for example, occurred 70-90 million years agoJ 5 ^ So computer comparisons of gene 
sequences that identify conserved non-coding sequences will be an indication of their importance in duties such as 
gene regulation J 6] 

Another comparative genomic approach to locating regulatory sequences in humans is the gene sequencing of the 
puffer fish. These vertebrates have essentially the same genes and regulatory gene sequences as humans, but with 
only one-eighth the "junk" DNA. The compact DNA sequence of the puffer fish makes it much easier to locate the 
regulatory genes. 

Other DNA 

Protein-coding sequences (specifically, coding exons) comprise less than 1.5% of the human genome. Aside from 
genes and known regulatory sequences, the human genome contains vast regions of DNA the function of which, if 
any, remains unknown. These regions in fact comprise the vast majority, by some estimates 97%, of the human 
genome size. Much of this is composed of: 

Repeat elements 

• Tandem repeats 

• Satellite DNA 

• Minisatellite 

• Microsatellite 

• Interspersed repeats 

• SINEs 

• LINEs 

Transposons 

• Retrotransposons 

• LTR 

• Tyl-copia 

• Ty3-gypsy 

• Non-LTR 

• SINEs 

• LINEs 

• DNA Transposons 
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Junk DNA 

However, there is also a large amount of sequence that does not fall under any known classification. Much of this 
sequence may be an evolutionary artifact that serves no present-day purpose, and these regions are sometimes 
collectively referred to as "junk" DNA. There are, however, a variety of emerging indications that many sequences 
within are likely to function in ways that are not fully understood. Recent experiments using microarrays have 

roi 

revealed that a substantial fraction of non-genic DNA is in fact transcribed into RNA, which leads to the 
possibility that the resulting transcripts may have some unknown function. Also, the evolutionary conservation 
across the mammalian genomes of much more sequence than can be explained by protein-coding regions indicates 
that many, and perhaps most, functional elements in the genome remain unknown. 1 The investigation of the vast 
quantity of sequence information in the human genome whose function remains unknown is currently a major 
avenue of scientific inquiry J 10] 

Information content 

The 3 billion base pairs of the haploid human genome correspond to an information content of about 750 megabytes, 
since every base pair can be coded by 2 bits. The entropy rate of the genome differs significantly between coding and 
non-coding sequences. It is close to the maximum of 2 bits per base pair for the coding sequences (about 45 million 
base pairs), and between 1.5 and 1.9 bits per base pair for each individual chromosome, except for the Y 
chromosome, which has an entropy rate below 0.9 bits per base pair. 11 1] 

Information content of the haploid human genome by chromosome: 
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Sequencing 

DNA sequencing determines the order of the nucleotide bases in a genome. 



Composite 

The Human Genome Project and a parallel project by Celera Genomics each produced and published a haploid 

ri3i 

human genome sequence, both of which were a composite of the DNA sequence of several individuals. 

Personal 

A personal genome sequence is a complete sequencing of the chemical base pairs that make up the DNA of a single 
person. Because medical treatments have different effects on different people because of genetic variations such as 
single-nucleotide polymorphisms (SNPs), the analysis of personal genomes may lead to personalized medical 
treatment based on individual genotypes. 

The completion of the fifth such map was announced in December 2008. The genome mapped was that of a Korean 
researcher Seong-Jin Kim. Genome maps had previously been completed for Craig Venter of the U.S. in 2007, 
James Watson of the U.S. in April 2008, and Yang Huanming of China in November 2008 and Dan Stoicescu in 
January 2008. [14][15][16] 

Personal genomes had not been sequenced in the Human Genome Project to protect the identity of volunteers who 

provided DNA samples. That sequence was derived from the DNA of several volunteers from a diverse 
ri7i 

population. Another distinction is that the HGP sequence is haploid, however, the sequence maps for Venter and 
Watson for example are diploid, representing both sets of chromosomes. 

Kim's genome had 1.58 million SNPs that had never been reported before and indicates that six out of 10,000 DNA 
bases are unique to Koreans. Kim's sequence map can be used to assist in building a standard Korean genome, which 
can then be used to compare the genomes of other Korean individuals for personalized medical treatments. 



Mapping 

Whereas a genome sequence lists the order of every DNA base in a genome, a genome map identifies the landmarks. 
A genome map is less detailed than a genome sequence and aids in navigating around the genome J 1 8 ^ ^ 



Variation 

An example of a variation map is the HapMap being developed by the International HapMap Project. The HapMap 
is a haplotype map of the human genome, "which will describe the common patterns of human DNA sequence 
variation. It catalogs the patterns of small-scale variations in the genome that involve single DNA letters, or 
bases. 

Researchers published the first sequence-based map of large-scale structural variation across the human genome in 
the journal Nature in May 2008. Large-scale structural variations are differences in the genome among people 

that range from a few thousand to a few million DNA bases; some are gains or losses of stretches of genome 
sequence and others appear as re-arrangements of stretches of sequence. These variations include differences in the 
number of copies individuals have of a particular gene, deletions, translocations and inversions. 
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Variation 

Most studies of human genetic variation have focused on single-nucleotide polymorphisms (SNPs), which are 

substitutions in individual bases along a chromosome. Most analyses estimate that SNPs occur on average 

somewhere between every 1 in 100 and 1 in 300 base pairs in the euchromatic human genome, although they do not 

occur at a uniform density. Thus follows the popular statement that "we are all, regardless of race, genetically 99.9% 
[231 

the same", although this would be somewhat qualified by most geneticists. For example, a much larger fraction of 
the genome is now thought to be involved in copy number variation P 4 ^ A large-scale collaborative effort to catalog 
SNP variations in the human genome is being undertaken by the International HapMap Project. 

The genomic loci and length of certain types of small repetitive sequences are highly variable from person to person, 
which is the basis of DNA fingerprinting and DNA paternity testing technologies. The heterochromatic portions of 
the human genome, which total several hundred million base pairs, are also thought to be quite variable within the 
human population (they are so repetitive and so long that they cannot be accurately sequenced with current 
technology). These regions contain few genes, and it is unclear whether any significant phenotypic effect results 
from typical variation in repeats or heterochromatin. 

Most gross genomic mutations in gamete germ cells probably result in in viable embryos; however, a number of 
human diseases are related to large-scale genomic abnormalities. Down syndrome, Turner Syndrome, and a number 
of other diseases result from nondisjunction of entire chromosomes. Cancer cells frequently have aneuploidy of 
chromosomes and chromosome arms, although a cause and effect relationship between aneuploidy and cancer has 
not been established. 

Genetic disorders 

Most aspects of human biology involve both genetic (inherited) and non-genetic (environmental) factors. Some 
inherited variation influences aspects of our biology that are not medical in nature (height, eye color, ability to taste 
or smell certain compounds, etc). Moreover, some genetic disorders only cause disease in combination with the 
appropriate environmental factors (such as diet). With these caveats, genetic disorders may be described as clinically 
defined diseases caused by genomic DNA sequence variation. In the most straightforward cases, the disorder can be 
associated with variation in a single gene. For example, cystic fibrosis is caused by mutations in the CFTR gene, and 
is the most common recessive disorder in Caucasian populations with over 1,300 different mutations known. 
Disease-causing mutations in specific genes are usually severe in terms of gene function, and are fortunately rare, 
thus genetic disorders are similarly individually rare. However, since there are many genes that can vary to cause 
genetic disorders, in aggregate they comprise a significant component of known medical conditions, especially in 
pediatric medicine. Molecularly characterized genetic disorders are those for which the underlying causal gene has 
been identified, currently there are approximately 2,200 such disorders annotated in the OMIM database. 1 

Studies of genetic disorders are often performed by means of family-based studies. In some instances population 
based approaches are employed, particularly in the case of so-called founder populations such as those in Finland, 
French-Canada, Utah, Sardinia, etc. Diagnosis and treatment of genetic disorders are usually performed by a 
geneticist-physician trained in clinical/medical genetics. The results of the Human Genome Project are likely to 
provide increased availability of genetic testing for gene-related disorders, and eventually improved treatment. 
Parents can be screened for hereditary conditions and counselled on the consequences, the probability it will be 
inherited, and how to avoid or ameliorate it in their offspring. 

As noted above, there are many different kinds of DNA sequence variation, ranging from complete extra or missing 
chromosomes down to single nucleotide changes. It is generally presumed that much naturally occurring genetic 
variation in human populations is phenotypically neutral, i.e. has little or no detectable effect on the physiology of 
the individual (although there may be fractional differences in fitness defined over evolutionary time frames). 
Genetic disorders can be caused by any or all known types of sequence variation. To molecularly characterize a new 
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genetic disorder, it is necessary to establish a causal link between a particular genomic sequence variant and the 
clinical disease under investigation. Such studies constitute the realm of human molecular genetics. 

With the advent of the Human Genome and International HapMap Project, it has become feasible to explore subtle 
genetic influences on many common disease conditions such as diabetes, asthma, migraine, schizophrenia, etc. 
Although some causal links have been made between genomic sequence variants in particular genes and some of 
these diseases, often with much publicity in the general media, these are usually not considered to be genetic 
disorders per se as their causes are complex, involving many different genetic and environmental factors. Thus there 
may be disagreement in particular cases whether a specific medical condition should be termed a genetic disorder. 

Evolution 

Comparative genomics studies of mammalian genomes suggest that approximately 5% of the human genome has 
been conserved by evolution since the divergence of those species approximately 200 million years ago, containing 
the vast majority of genes J 10 ^ ^ Intriguingly, since genes and known regulatory sequences probably comprise less 
than 2% of the genome, this suggests that there may be more unknown functional sequence than known functional 
sequence. A smaller, yet large, fraction of human genes seem to be shared among most known vertebrates. The 
chimpanzee genome is variously reported to be 94%-98.5% identical to the human genome in scientific literature P 6 ^ 
L J On average, a typical human protein-coding gene differs from its chimpanzee ortholog by only two amino acid 
substitutions; nearly one third of human genes have exactly the same protein translation as their chimpanzee 
orthologs. A major difference between the two genomes is human chromosome 2, which is equivalent to a fusion 
product of chimpanzee chromosomes 12 and 13^ (later renamed to chromosomes 2 A and 2B, respectively). 

Humans have undergone an extraordinary loss of olfactory receptor genes during our recent evolution, which 
explains our relatively crude sense of smell compared to most other mammals. Evolutionary evidence suggests that 
the emergence of color vision in humans and several other primate species has diminished the need for the sense of 
smell. [29] 

Mitochondrial genome 

The human mitochondrial genome, while usually not included when referring to the "human genome", is of 
tremendous interest to geneticists, since it undoubtedly plays a role in mitochondrial disease. It also sheds light on 
human evolution; for example, analysis of variation in the human mitochondrial genome has led to the postulation of 
a recent common ancestor for all humans on the maternal line of descent, (see Mitochondrial Eve) 

Due to the lack of a system for checking for copying errors, Mitochondrial DNA (mtDNA) has a more rapid rate of 
variation than nuclear DNA. This 20-fold increase in the mutation rate allows mtDNA to be used for more accurate 
tracing of maternal ancestry. Studies of mtDNA in populations have allowed ancient migration paths to be traced, 
such as the migration of Native Americans from Siberia or Polynesians from southeastern Asia. It has also been used 
to show that there is no trace of Neanderthal DNA in the European gene mixture inherited through purely maternal 
lineage P 0 ^ 

Epigenome 

Epigenetics are a variety of features of the human genome that transcend its primary DNA sequence, such as 
chromatin packaging, histone modifications and DNA methylation, and which are important in regulating gene 
expression, genome replication and other cellular processes. Basically they are marks on DNA that are influenced 
within an individual's own life. Over-eating for years will change a person's epigenetics in such a way that children, 
grandchildren, and the next few generations will be predisposed to a shorter and unhealthier life. Drugs are being 
researched that can change one's epigenetics, and which when fully developed will be able to "shut off" certain 
diseases and disorders. Overall epigenetics strengthen and weaken certain genes but are not an actual part of the 
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DNA. [31] [32] 

See also 

• Craig Venter#Individual human genome sequenced 

• Eugenics 

• Eukaryotic chromosome fine structure 

• Genetic distance 

• Genomic organization 

• Human genetic engineering 

• Noncoding DNA 

• The Genographic Project 

• Y-chromosomal Adam 
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